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Abstract 

The  purpose  of  this  research  is  to  explore  methods  used  to  parallelize  NP-complete  problems 
and  the  degree  of  improvement  that  can  be  realized  using  different  methods  of  load  balancing. 

A  serial  and  four  parallel  A*  branch  and  bound  algorithms  were  implemented  and  executed 
on  an  Intel  iPSC/2  hypercube  computer.  One  parallel  algorithm  used  a  global,  or  centralized,  list 
to  store  unfinished  work  and  the  other  three  parallel  algorithms  used  a  distributed  list  to  store 
unfinished  work  locally  on  each  processor. 

The  three  distributed  list  algorithms  are:  without  load  balancing,  with  load  balancing,  and 
with  load  balancing  and  work  distribution.  The  difference  between  load  balancing  and  work  distri¬ 
bution  is  load  balancing  only  occurs  when  a  processor  becom^c  idle  and  work  distribution  attempts 
to  emulate  the  global  list  of  unfinished  work  by  sharing  work  throughout  the  algorithm,  not  just 
at  the  end.  Factors  which  effect  when  and  how  often  to  load  balance  are  also  investigated. 

Which  algorithm  performed  best  depended  on  how  many  processors  were  used  to  solve  the 
(iroblem.  For  a  small  number  of  processors,  16  or  less,  the  centralized  list  algorithm  easily  outper¬ 
formed  all  others.  However,  after  16  processors,  the  overhead  of  all  processors  trying  to  commu¬ 
nicate  and  reeiuest  work  from  the  same  centralized  list  began  to  outweigh  any  benefits  of  having  a 
global  list.  Now  the  distributed  list  algorithms  began  to  perform  best.  When  using  32  processors, 
the  distributed  list  with  load  balancing  and  work  distribution  out  performed  the  other  algorithms. 
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IMPLEMENTATION  AND  ANALYSIS  OF  NP-COMPLETE  ALGORITHMS  ON 
A  DISTRIBUTED  MEMORY  COMPUTER 


/.  Introduction 


I.l  General  Problem  Statement 

The  Department  of  Defense  tod',y  is  tasked  with  performing  tlie  same  mission  as  ten  years  ago, 
but  with  fewer  personnel  ,  -id  less  equipment.  Sophisticated  technology  controlled  by  computers 
allows  the  United  States  to  continue  its  military  leadership  of  the  world.  To  continue  this  leadership, 
algorithms  must  become  more  efficient  as  the  tasks  required  of  them  become  more  complex. 

One  of  the  most  widely  used  problem  solving  techniques  is  exhaustive  search,  which  searclies 
all  possible  answers  and  selects  the  best  solution.  But  what  happens  if  the  answer  to  the  overall 
problem  depends  on  the  answer  to  many  sub-problems  within  the  main  problem?  Every  possible 
combination  of  answers  must  be  investigated  to  find  the  best  or  optimal  solution.  Combinatorial 
searches  for  small  problems  are  possible,  but  the  number  of  possible  solutions  which  must  be  checked 
can  expand  exponentially  beyond  our  limits  in  tim  ■  and  memory  space  to  search  them.  Many  real 
world  [iroblerns  in  artificial  intelligetice,  operations  re.search,  V'bSJ  chip  layout  and  wire  routing, 
and  wa'a|ion  to  target  assignment  [)roblems  can  use  this  exhaustive  search  technique. 

Combinatorial  searches  whose  execution  times  increase  exponentially  with  a  linear  addition 
of  iid’ormation  to  the  problem  are  in  the  cla.ss  of  lion-deterministic  polynomial  (NP)  complete  (irob- 
lems.  Examples  of  NP-comph  •  [irobh'ms  include  the  knapsack  problem,  the  traveling  salesman 
problem,  the  set  covering  problem,  the  a.ssignment  problem,  and  many  others.  This  research  investi¬ 
gates  elemi'iitary  heuristics  ii.aal  to  solve  .NP-complele  problems  on  distributed  memory  computers. 
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the  search  space,  the  example  of  2  weapons  and  10  targets  requires  1024  bytes.  However,  when 
their  are  50  targets,  the  memory  requirements  grow  to  1,130,000,000,000,000  bytes  or  1,130  giga 
bytes.  [Carpenter,  1986;  35]  This  amount  of  memory  is  available  on  few  computers.  Obviously, 
more  efficient  methods  must  be  found  for  solving  these  problems.  However,  as  is  shown  in  Chapter 
II,  NP-complete  problems  can  require  polynomial  space  if  properly  designed. 

One  method  to  shorten  the  time  to  find  a  solution  is  to  accept  a  less  than  optimal  solution. 
Pearl  provides  heuristics,  or  guidelines,  using  an  error  function  as  a  bound  on  the  solution.  This 
allows  any  .solution  within  a  predetermined  range  to  be  accepted  as  a  solution  .  Probabilistic  meth¬ 
ods  such  as  Monte  Carlo  algorithms  control  the  search  based  on  probabilities  of  finding  a  solution 
down  a  certain  path.  This  method  can  also  return  a  less  than  optimal  solution  [Pearl  1984:  86-89]. 
Another  non-optimal  search  method  is  the  genetic  algorithm.  This  algorithm  solves  problems  by 
manipulating  strings  of  instructions  the  same  way  chromosomes  manipulate  DNA.  The  process 
involves  a  complex  search  that  combines  blind  groping  with  precise  accounting  [Anlonoff,  1991:70] 
Since  tiiis  re.search  only  considers  optimal  solutions,  none  of  these  techniques  are  investigated. 

Another  method  to  shorten  the  time  to  find  a  solution  is  to  increa.se  the  computing  power  of 
the  computer.  In  the  past,  programmers  used  faster  computers  to  solve  NP-complete  problems  as 
the  problems  increased  in  complexity.  As  Hennessy  and  Joupppi  point  out,  the  two  most  important 
factors  in  the  high  growth  rate  in  computing  power  is  the  dramatic  increase  in  the  number  of 
tran.sistors  available  on  chip  and  architectural  advances  including  the  use  of  RISC  ideas,  pipelining, 
and  caches.  With  all  these  improvements,  central  jjrocessor  unit  (CPU)  performance  has  increased 
95,000%  since  1980.  [Hennessy  and  .Jouppi,  1991:  19-23], 

However,  traditional  sec|uential  computers  are  approaching  the  theoretical  limit  of  the  time 
required  to  perform  a  computation,  'fhe  limiting  factor  on  computational  speed  is  propagation 
delay  of  signals  between  transistors  on  the  same  chip.  This  propagation  delay  consists  of  a  gate 
delay  caused  by  tin'  transistor  it.s<'lf  atnl  signal  travel  time  between  transistors,  Cate  delay  has  been 
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reduced  to  the  point  where  signal  travel  time  between  transistors  is  the  dominant  delay.  This  travel 
delay  is  being  reduced  by  making  the  transistors  smaller  and  placing  the  transistors  closer  together 
on  the  chip.  While  the  number  of  transistors  on  a  chip  can  be  quite  high,  there  is  obviously  a  limit 
to  the  size  and  space  required  for  each  transistor.  Since  the  speed  of  light  limits  the  time  required 
for  a  signal  to  travel  between  transistors,  other  methods  to  improve  computation  power  are  being 
sought  [DeCegama,  1989:  23-27]. 

The  design  time  for  main  frame  and  minicomputers  is  approximately  four  to  five  years  while 
that  of  a  microcomputer  is  approximately  two  years.  According  to  Hennessy  and  Jouppi,  this 
shorter  design  time  allows  the  computers  based  on  microproce.ssor  technology  to  take  full  advantage 
of  the  rapid  changes  in  VLSI  technology  and  changes  in  computer  architecture.  They  show  that 
the  computing  power  and  speed  of  microcomputers  is  on  par  with  mainframes  and  i.s  quickly 
approaching  that  of  uniprocessor  supercomputers.  [Hennessy  and  Jouppi,  1991:19].  Bell  contends 
computers  built  using  microprocessors  connected  together  to  form  a  multiprocessor  computer  is  the 
trend  of  the  future  in  computing  [Bell,  1989:  1093-1097]. 

To  obtain  more  computations  in  the  same  amount  of  time,  multiprocessor  computers  are  now 
used.  Each  task  is  divided  into  sub-tasks  and  assigned  to  one  of  many  processors  in  the  computer. 
By  using  multiprocessors,  a  solution  to  the  task  may  be  found  in  less  time  than  required  for  the 
si'quential  computer. 

/ .  ?  Scope 

Tliis  research  investigates  the  use  of  multiprocessor  computers  to  solve  NP-complete  problems. 
F'o  do  this,  search  algorithms  are  designed  and  implemented  on  an  Intel  iE’SC/2  hypercube.  Various 
si'arcli  strategies  such  as  depth  first  search,  breadth  first  search,  backtracking,  best  first  search, 
and  branch  and  bound  are  incorporated  into  different  algorithms  to  determine  their  effect  on  the 
algoril hm’s  efficiency.  .Since  this  research  is  concerneil  only  v\ith  ojitimal  .solutions,  non-optimal 


algorithms  such  as  probabilistic  or  genetic  algorithms  are  not  considered.  User  applications  to  test 
the  search  algorithm  are  also  designed  and  implemented  on  the  hypercube. 

The  goal  of  this  research  is  to  address  the  following: 

1.  Study  and  analyze  previous  works 

2.  Investigate  the  effects  of  different  static  and  dynamic  load  balancing  techniques. 

3.  Investigate  when  and  how  to  communicate  global  information. 

4.  Investigate  the  effects  of  keeping  the  list  of  work  to  be  done  in  a  centralized  list  on  one  master 
processor  or  on  distributed  lists  on  many  processors. 

5.  Detertnine  the  type  of  algorithm,  or  combination  of  algorithms,  which  best  suit  a  particular 
problem. 

().  Develo|)  appro[)riate  performance  metrics  to  evaluate  each  algorithm. 

7.  Investigate  the  amount  of  communication  vs  computation  in  each  algorithm. 

8.  Investigate  the  underlying  heuristics  common  to  all  NP-complete  problems. 

While  this  list  is  far  from  complete,  the  time  constraint  placed  upon  this  research  limits  the 
topics  which  can  be  investigated. 

Many  different  metrics  are  used  to  evaluate  the  time  efficiency  of  a  parallel  search  algorithm. 
Thi:s  thesis  investigation  bases  performance  of  an  algorithm  primarily  on  speedup  and  number  of 
states  generated  by  the  algorithm.  Other  metrics  considered  include  processor  idle  time,  efficiency, 
and  the  ratio  of  communication  versus  computation  time.  Metrics  to  measure  the  sp.ice  efficiency 
are  not  considered  due  to  the  limited  time  for  this  research, 

j\  comprehensive  literature  search  covering  the  topics  of  search  algorithms,  parallel  proce.ssing, 
performance  analysis,  and  hypercube  computers  provides  the  foundation  for  this  re.search  and  is 
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discussed  in  chapter  2.  Analyzing  search  algorithms  developed  at  AFIT  or  stored  at  software 
repositories  around  the  country  provides  additional  understanding  of  the  problem. 

Since  parallel  algorithms  are  especially  difficult  to  design  and  implement,  this  research  follows 
standard  software  engineering  practices  for  documenting,  testing,  and  designing  programs. 

Since  the  sequential  algorithm  is  the  standard  against  which  the  parallel  algorithms  are  ini¬ 
tially  measured,  the  first  task  in  the  research  effort  is  designing  and  implementing  a  sequential  A* 
algorithm.  After  determining  critical  parameters  of  the  program,  they  are  measured  which  pro¬ 
vides  a  baseline  for  compari.son  to  later  versions  of  the  program.  After  implementing  the  sequential 
algorithm  on  the  parallel  computer,  baseline  measurements  of  critical  parameters  are  again  taken. 
Changes  to  the  parallel  program  are  made  and  the  parameters  again  measured.  After  each  change 
to  the  program,  data  was  collected  and  analyzed  to  determine  the  effect  of  the  changes  and  to  help 
determine  what  change  to  make  next.  All  data  was  analyzed  looking  for  fundamental  heuristics  to 
solving  NP  complete  problems  on  parallel  computers. 

1.4  Surnrnary  of  the  Thesis 

In  this  chapter,  a  working  definition  of  NP-complete  problems  is  provided  and  a  quick  example 
to  justify  the  need  to  study  and  improve  the  methods  for  solving  them  on  a  distributed  memory 
computer.  The  scope  of  the  research  is  then  presented. 

The  rest  of  the  thesis  is  composed  of  five  additional  chapters.  Chapter  II  is  the  literature 
search  to  determitie  the  current  state  of  the  art.  Chapter  111  provides  the  high  level  design  of 
the  algorithms  and  the  measurement  criteria.  The  low  level  design  is  provided  in  Chapter  IV. 
Chapter  V  di.scu.s.se.s  the  results  received  from  the  different  algorithms  and  Chapter  VI  presents  my 


cfuiclnsions  and  recomriK'iidations  for  future  work. 


This  thesis  assumes  a  general  understanding  of  sequential  and  parallel  computers  along  with 
some  understanding  of  the  search  techniques.  A  quick  explanation  of  concepts  and  idecis  is  provided 
in  the  following  chapters  and  references  are  given  for  further  study. 
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11.  Background  and  Requirements 


2.1  Introduction 

To  perform  this  research  an  understanding  of  NF-complcte  problems,  search  techniques,  and 
parallel  architectures  is  essential.  Each  of  these  topics  is  discussed  in  its  own  main  section.  Also, 
a  section  discussing  current  topics  in  parallel  search  techni<iues  is  provided.  Each  of  these  subjects 
has  been  e.\tensively  studied  in  books  and  journals,  so  only  an  overview  of  the  subjects  is  provided 
here.  Eleferences  are  listed  to  provide  a  more  in-depth  study  of  each  topic  if  desired. 

2.2  S’  P-Coinplcte 

Ikassard  and  Brantley  define  NP-complete  problems  in  terms  of  two  conditions.  The  first 
condition  is  that  the  problem  be  a  member  of  NE*  space.  .\  problem  is  in  NP  space  if  it  can  be 
solved  on  a  non-deterministic  Turing  machine  (NEXl'M)  in  polynomial  time.  Since  a  NDTM  is  a 
computing  model  which  can  solve  an  infinite  numix'r  of  problems  in  parallel,  it  can  solve  both 
l)olynomial  and  nonpolynomial  time  problems  in  polynomial  time. 

The  second  condition  required  for  a  problem  to  be  NP-complete  is  that  NP-complete  problems 
must  be  transformable  to  every  other  NP-complete  prol)lem  in  polynomial  time  [Brassard  and 
Ikantley,  1985:  323-325].  Therefore,  if  a  polynomial  tim<'  solution  is  found  to  any  of  the  NP- 
complete  problems,  all  can  be  solved  in  polynomial  time.  One  goal  of  this  research  is  to  investigate 
the  heuristics  which  are  common  to  |)arallel  NP-complete  problems. 

Aho  and  others  provide  relationships  between  diEfereut  classes  of  problems  as  shown  in  Figure 
2.1.  I'hey  also  prove  that  E’-space  Is  identically  equal  to  NE’-space.  Therefore,  if  a  problem  is  in 
NE^-time,  it  is  in  P-space  [Alio  and  others,  197d;  39-5].  Tliis  figure  afso  shows  the  possibility  that 
other  problems  in  NP-time  are  al.so  NP-complete. 


NP SPACE 


Figure  2.1.  Space  Time  Relationships 


Aho  lists  the  following  as  NP-cornplete  problems: 


1.  Satisfiability  —  Is  a  Boolean  expression  satisfiable? 

2.  Clique  —  Does  an  undirected  graph  have  a  clique  of  size  k? 

3.  Hamilton  Circuit  —  Does  an  undirected  graph  have  a  Hamilton  circuit? 

4.  Colorability  —  Is  an  undirected  graph  /t-colorable? 

5.  PVedback  Vertex  Set  Does  a  directed  graph  have  a  feed  back  vertex  set  with  k  members? 

6.  Feedback  Edge  Set  —  Does  a  directed  graph  have  a  feedback  edge  set  with  k  members? 

7.  Directed  Hamilton  Circuit  Does  a  diri'cted  graph  have  a  directed  Hamilton  circuit? 

8.  Set  (Jover  -  -  (Hven  a  family  of  sets  5i,52,...,5„  does  there  exist  a  subfamily  of  k  sets 

,  Si-j , . . . ,  such  that 

A*  n 

u  -s, = u  Si 

2=1  j=i 

t).  Exact  Cover  -  f  liven  a  family  of  sets  5i ,  -S, . .  , ,  5„  does  there  exist  a  set  cover  consisting  of 
a  subfamily  of  [lairwise  disjoint  sets? 

[Aho  and  others,  1971:  379].  See  Aho  or  Christilides  for  a  more  detailed  explanation  of  the  above 
listed  problems  [Christifides,  1974:  1-7')]. 

riie  working  definition  used  in  this  research  is  the  class  of  problems  for  which  the  time 
complexity  has  an  exponential  function  ;us  a  lower  bound.  The  Traveling  Salesman  Problem  (TSP), 
t]if>  c'nvf'ring  Problem  (SCP).  the  assigiiiTient  problem,  and  the  Knapsack  Problem  are  all  a 
subset  of  one  of  the  above  mentioned  NP-complete  problems  and  are  themselves  NP-complcte.  As 
shown  in  F  igure  2.1,  the  time  complexity  of  these  problems  is  0(c")  and  the  space  complexity  is 
0(n)  [.Jansen  and  Si jsteriiians.  1989,  271]  [Bra.ssard  and  Hratley,  1988:  324,  336-337]. 


2.3  Parallel  Architectures 


The  most  common  computer  in  use  today  is  a  serial  machine  which  physically  performs  one 
task  at  a  time.  Each  task  must  be  performed  in  a  definite  order  with  one  task  following  the  other 
in  sequence.  In  contrast,  a  parallel  computer  can  perform  any  number  of  different  tasks  at  the 
same  time  limited  only  by  the  number  of  processors  available  to  perform  the  tasks.  One  useful 
comparison  of  differences  in  the  implementations  of  an  algorithm  on  serial  and  parallel  machines  is 
speedup.  According  to  Miller  and  Penke,  speedup  is  defined  as  the  ratio  of  the  time  the  algorithm 
takes  to  run  on  a  serial  compute'-  versus  the  time  the  algorithm  takes  to  run  on  a  parallel  computer. 
The  formula  for  speedup  is 

-b  —  i  »  (rial/ ^  parallel 

Another  comparison  between  serial  and  parallel  computers  is  in  the  area  of  efficiency.  Effi¬ 
ciency  is  defined  as  speedup  per  processor  in  the  jiarallel  system  [Miller  and  Penke,  1989:  133], 
Thus, 

E  =  .9/P 

where  F’  is  the  number  of  processors  in  the  parallel  system.  Ideally,  the  speedup  is  a  linear 
function  and  the  efficiency  is  constant.  For  reasons  discussed  later,  this  is  very  seldom  the  C2ise. 

2.3.0. 1  Types  of  Parallel  Computers  One  of  the  main  differences  among  the  categories 
of  parallel  computers  in  use  today  is  how  their  memories  are  organized.  As  described  by  DeCegama. 
shared  memory  computers  have  one  larg<‘  block  of  memory  which  all  the  processors  can  acce.ss. 
Communication  between  processors  is  accomplished  by  one  processor  placing  information  in  a 
memory  location  ancl  other  (irocessors  reading  that  location.  The  other  system  of  memory  storage 
is  a  distributed  memory  computer.  In  this  architecture,  every  processor  has  its  own  memory 
which  only  it  can  access.  (  aimmunicat  irui  between  proce.s.sors  is  accomplished  by  [>a.ssing  messages 


between  the  processors  [DeCegama  1989;  18-23  and  62-64].  This  research  concentrates  only  on  the 
distributed  memory  computer. 

Another  main  difference  among  parallel  computers  is  the  communications  network  used  to 
pass  information.  One  method  is  to  connect  all  the  processors  and  memory  to  a  bus.  This  allows 
a  few  lines  to  completely  connect  all  processors  and  memory.  The  main  disadvantage  of  a  bus 
architecture  is  of  the  limited  bandwidth  of  the  bus.  According  to  DeCegama, 

",  .  .  if  the  number  of  processors  is  large  (from  50  to  100  processors  with  pre.sent  technol¬ 
ogy),  the  delays  due  to  bus  contentions  for  interprocessor  communications  and  global 
memory  accesses  are  increasingly  unacceptable,  and  performance  degrades  rapidly” 

[DeCegama,  1989:  192]. 

The  other  communications  network  is  a  stvifchtng  network  of  interconnecting  lines.  Proces- 
.sors  and  meniory  cells  are  directly  connected  only  to  a  fraction  of  the  total  number  of  processors 
available.  DeCegama  provides  an  explanation  for  many  types  of  switching  networks  including  the 
crossbar  network,  the  wraparound  mesh  network,  the  shuffle-exchange  network,  the  SW-Banyon, 
and  the  generalized  cube  [DeCegama,  1989:  199-253].  Me.ssages  between  processors  not  directly 
connected  must  be  routed,  or  switched,  by  intermediate  processors  or  switching  units  similar  to 
tho.se  u.sed  in  telephone  switching  circuits. 

DeCegama  also  categorized  parallel  computers  by  the  way  instructions  and  data  are  processed. 
Only  th<'  two  most  common  categories,  single  instruction  multiple  data  (SIMD)  and  multiple  input 
multiple  data  (MIMD)  are  discussed  here  [DeCegama,  1985:  63-65]. 

•  .\  SIMD  computer  has  multiple  processors  with  each  processor  performing  the  same  instruc¬ 
tion  on  different  data  at  the  same  time.  All  instructions  are  executed  synchronously  on  all 


processors. 


•  A  MIMD  computer  has  multiple  processors,  each  capable  of  asynchronously  executing  dif¬ 
ferent  instructions  on  different  data  sets.  The  processors  can  work  independently  or  as  a 
group. 


2.3.1  Hypercube  Architecture  The  hypercube  is  a  distributed  memory  MIMD  computer. 
Each  processor  has  its  own  local  memory  and  information  is  disseminated  by  passing  messages 
between  processors.  Each  processor  can  work  independently  on  its  own  data,  or  work  as  a  group 
on  shared  data. 

Hayes  and  Mudge  describe  a  hypercube  architecture  as  a  generalization  of  the  3  dimensional 
cube  graph  to  an  arbitrary  number  of  dimensions.  Just  as  a  3  dimensional  physical  cube  has  2^ 
vertices,  .so  an  n  dimensional  hypercube  has  N  =  2"  nodes.  Each  vertex  ha.s  n  nearest  neighbors. 
Figure  2.2  shows  four  examples  of  the  connections  of  a  hypercube  for  different  values  of  n.  This 
topology  guarantees  that  any  2  vertices  are  no  more  than  n  links  apart.  Therefore,  the  time  to 
communicate  between  any  2  vertices  is  logoN  in  the  worst  case  [Hayes  and  Mudge,  1989;  1829-1830]. 


2.3.2  Granularity  Grain  size,  or  granularity,  is  used  in  parallel  computers  to  describe  the 
relative  size  or  frequency  of  an  event  eis  compared  to  other  events  of  the  same  type.  DeCegama 
categorizes  granularity  into  2  broad  areas:  system  and  application  granularity.  System  granularity  is 
used  to  describe  attributes  of  the  hardware  and  physical  configuration  while  application  granularity 
describes  the  characteristics  of  the  particular  problem  being  solved  [Decegama  ,1989:  8-9].  Both  of 
these  granularities  influence  the  effectiveness  of  a  parallel  program. 

2.3.2. 1  System  Granularity  System  granularity  is  cla.ssified  into  three  grain  sizes: 
coar.se  grain,  medium  grain,  and  fine  grain.  Generally,  a  coar.se  grained  multi-processor  com¬ 
puter  has  a  small  number  of  large,  complex  processors  while  a  fine  grain  multi-processor  computer 
has  a  large  number  of  small,  relatively  simple  proces-sors.  As  the  name  implies,  a  medium  grain 
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computer  is  in  between  coarse  and  fine  grain  in  both  the  number  of  processors  and  the  complexity 
of  the  processors  used.  Most  of  the  commercial  parallel  computers  in  use  today  are  medium  grain 
[Decegama  ,1989:  8-9],  The  definition  of  what  is  fine  or  course  grain  is  subjective  and  constantly 
changes  as  the  technology  changes. 

2. 3. 2. 2  Application  Granularity  DeCegama  divides  application  granularity  into  event 
and  task  granularity.  A  task  is  defined  as  a  program  segment  which  must  be  executed  sequentially. 
Task  granularity  is  the  average  amount  of  computation  performed  by  each  task  of  the  program. 
Event  granularity  is  a  measure  of  the  average  amount  of  computation  performed  by  the  processors 
between  events  of  a  certain  type.  For  example,  communication  granularity  is  the  amount  of  compu¬ 
tation  between  message  events.  Other  common  event  granularities  are  synchronization,  heuristic, 
and  voting  [Decegama  ,1989:  8-9]. 

Like  system  granularity,  event  granularity  is  also  mccisured  by  relative  comparisons  between 
the  number  of  events.  Coarse  grein  events  have  relatively  large  amounts  of  computations  between 
events  and  fine  grain  events  occur  relative  frequently. 

2. 3. 2. 3  Granularity  Tradeoffs  There  are  overhead  costs  associated  with  event  granu¬ 
larities.  Processing  the  event,  calls  to  operating  system  or  other  functions,  communication  between 
processors,  and  resource  contention  are  all  overhead  which  reduce  the  amount  of  time  spent  solving 
the  problem.  Increcising  the  event  granularity  decrecises  these  costs.  However,  many  algorithms 
require  fine  grain  events  to  operate  efficiently.  For  example,  information  calculated  by  one  processor 
might  be  required  by  all  processors.  Delaying  the  communication  of  the  data  could  result  in  longer 
execution  times  for  the  algorithm. 

Increasing  task  granularity  decreases  the  number  of  tasks  in  a  computation.  This  decreases 
the  overhead  associated  with  task  creation  and  termination,  but  might  introduce  other  costs.  A 
small  number  of  large  tasks  might  make  it  harder  to  balance  the  work  load  between  the  processor.s 
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because  the  tasks  can  not  be  divided  into  smaller  work.  This  could  result  in  idle  processors  and  a 
longer  execution  time  for  an  algorithm. 

Whenever  designing  a  parallel  algorithm,  both  system  and  application  granularity  must  be 
considered.  Tradeoffs  between  event  grain  and  the  associated  overhead  costs  must  be  carefully 
weighed  to  obtain  the  optimal  performance  of  the  algorithm. 

2.4  Parallel  Search  Issues 

Architectures  discussed  previously  are  applicable  to  solving  both  serial  and  parallel  search 
algorithms.  Ho.vever,  there  are  issues  that  pertain  only  to  parallel  computers.  Communication  of 
global  variables  to  all  processors  and  distributing  the  work  load  evenly  among  the  processors  are 
tlie  main  concerns  addressed. 

2.41  Global  Variable  Communicahon  Since  this  research  covers  distributed  memory  com¬ 
puters  only,  each  processor  has  access  only  to  the  information  stored  at  the  local  location.  If  a  global 
variable  changes,  this  new  value  must  be  communicated  to  all  the  jirocessors.  Processor  time  spent 
communicating  subtracts  from  the  time  spent  solving  the  problem.  Jansen  and  Sijstermans  recom¬ 
mend  partitioning  the  processors  into  groups  that  have  only  one  master  processor  communicating 
outside  the  group.  This  reduces  the  number  of  unnecessary  communication.^  between  processors 
because  the  master  validates  the  new  global  data  before  transmitting  it  inside  the  group  or  to  other 
groups.  [Jansen  and  Sijstermans,  1989:  27o].  .Another  method  is  to  transmit  new  global  variables 
to  all  proces.sors  at  the  same  time.  Still  another  method  is  to  wait  until  certain  control  points  in 
the  algorithm  oefore  transi  'tting  global  information.  This  last  method  reduces  communication, 
but  at  the  cost  of  possibly  expanding  unnecessary  states  in  the  search  graph  a.ssociated  with  the 
NP-cornplete  problem. 
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2.4-^  Task  Allocation  Methods  There  are  two  ways  to  allocate  new  tasks  to  the  processors; 
static  and  dynamic.  Static  allocation  is  done  a  prion.  Dynamic  allocation  assigns  processors  to 
a  problem  as  new  children  or  sub-problems  are  generated.  NP-complete  problem  .solutions  are 
probabilistic  in  that  the  order  in  which  branches  of  the  search  graph  are  explored  can  not  be 
determined.  Al.so,  the  amount  of  work  in  each  branch  or  sub-branch  of  the  search  graph  can  not 
be  determined.  Therefore,  sub-branches  generated  by  states,  or  sub-problems,  are  generated  in  an 
unpredictable  fashion.  Static  allocation  of  only  certain  branches  of  the  search  graph  to  particular 
groups  of  processors  would  lead  to  processors  which  finish  early  being  idle  until  the  last  processor 
finishes.  Dynamic  allocation  allows  new  sub-problems  to  be  assigned  to  processors  with  few  or  no 
problems  waiting  to  be  run.  This  allows  all  the  processors  to  be  active  approximately  the  same 
length  of  time. 

2.4.2. 1  Centralized  Versus  Distributed  List  Dynamic  allocation  has  two  methods  for 
allocating  tasks.  The  first,  centralized  list  (CL),  keeps  the  listing  of  all  the  sub-problems  generated 
in  one  processor  called  the  master.  The  processing  of  the  siib-problern  is  done  in  all  the  other 
proce.s.sors  called  slaves.  The  advantage  of  the  CL  is  the  global  “best”  sub-problem  is  always 
assigned  to  the  next  available  processor.  A  disadvantage  of  the  CL  is  when  a  slave  finishes  or 
generates  a  sub-problem,  it  must  contact  the  master  to  insert  the  sub-problem  or  to  receive  its  next 
problem  to  work.  This  requires  two  communications  for  each  initiation  of  a  sub-problem.  According 
to  Quinn,  adding  additional  processors  to  the  system  cairses  a  linear  increa.se  in  the  communication 
overhead  of  t  he  parallel  algorithm.  Also,  the  master  is  involved  in  all  of  these  communications  and 
if  ran  become  a  botthuieck  for  the  system.  When  messages  begin  to  back  up  at  the  master,  slaves 
become  itile  waiting  for  a  response.  Eventually,  the  communication  overhead  to  the  master  becomes 
the  dominant  computational  factor  and  adding  more  proce.s.sors  to  the  problem  can  actually  increase 
the  execution  time  of  the  algorithm  [Quinn,  1990:  38.')].  'f'h(“  advantage  of  always  assigning  the 
best  sub-problem  to  a  proce.s.sor  is  greatly  outweighed  by  the  cost  of  communicat  ion  and  waiting! 


2-10 


The  other  approach  to  dynamic  allocation  of  sub-problems  is  the  distributed  list  (DL).  In 
this  approach,  each  processor  maintains  a  list  of  sub-problems  waiting  to  be  worked.  Therefore, 
when  the  processor  completes  or  generates  a  task,  all  communication  is  within  the  processor  and  no 
message  is  passed  to  another  processor.  This  eliminates  the  bottleneck  of  having  to  communicate 
twice  with  the  master  processor  when  a  task  is  completed. 

One  problem  with  the  DL  method  is  not  search  all  problems  generate  the  same  amount  of 
sub-problems.  Therefore,  to  keep  processors  from  being  idle  while  sub-problems  are  still  waiting 
to  be  run  on  other  processors,  a  method  of  load  balancing  must  be  implemented.  Several  different 
load  balancing  algorithms  and  their  performance  are  discussed  by  Quinn  [Quinn,  1990:  385].  The 
choice  of  which  sub-problem  to  keep  and  which  one  to  send  to  another  processor  and  when  to 
balance  the  loads  greatly  affected  the  efficiency  of  the  search  algorithm.  While  load  balancing  adds 
to  the  communication  overhead,  the  benefit  of  reduced  processor  idle  time  greatly  outweighs  the 
e.xtra  cost  in  lost  computation  time  due  to  communication  [Ma  and  others,  1988;  1507-1510]. 

2.4.S  Limits  on  Speedup  and  Efficiency  Many  other  factors  can  limit  the  speedup  and  ef¬ 
ficiency  of  a  parallel  algorithm.  Since  almost  all  programs  have  statements  or  routines  which 
don’t  depend  on  values  from  other  parts  of  the  program,  recognizing  and  efficiently  exploiting  this 
parallelism  is  essential  to  produce  the  maximum  speedup  possible  [Hayes  and  Mudge,  1989:1834]. 
■Another  area  that  reduces  efficiency  is  having  tl  processors  idle  until  enough  tasks  are  generated. 
Changing  the  method  of  generating  the  tasks  depending  on  where  the  algorithm  is  in  the  search 
can  minimize  this  problem.  A  final  problem  is  that  parts  of  the  algorithm  cannot  be  parallelized. 
Starting,  terminating,  and  certain  other  procedures  in  an  algorithm  are  inherently  sequential  and 
cannot  be  done  in  parallel.  All  these  items  decrease  the  speedup  of  an  algorithm. 
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2.5  General  Search  Techniques 


A  search  problem  can  be  represented  by  a  search  graph  with  the  root  of  the  graph  representing 
the  complete  problem  to  be  solved.  Each  child  node  represents  a  subproblem  of  the  parent  node 
and  represents  inclusion  of  one  or  more  constraints  to  the  problem  [Quinn,  1990:  384]  [Hayes  and 
Mudge,  1989:  1838],  To  find  an  optimal  solution,  every  node,  or  state,  of  the  graph  must  be 
explicitly  or  implicitly  checked  to  see  if  it  is  a  solution.  As  described  in  Chapter  I,  the  state  space 
can  be  extremely  large  making  it  impossible  to  explicitly  check  every  node.  The  main  difference 
between  the  search  techniques  described  below  is  the  order  in  which  the  nodes  are  selected  to  be 
investigated  or  explored.  The  rest  of  this  section  describes  various  search  techniques:  the  greedy 
method,  uninformed  search,  backtracking,  branch  and  bound,  and  best  first  search.  While  this  is 
not  a  definitive  list  of  search  techniques,  most  other  techniques  which  provide  an  optimal  solution 
are  some  combination  of  the  techniques  described. 

2.5.1  Greedy  Method  In  some  greedy  algorithms,  enough  information  is  known  about  the 
problem  to  always  ensure  the  search  is  on  the  path  to  the  optimal  solution.  Since  at  each  stage  of 
the  search  the  “best”  node  is  selected,  only  the  minimum  number  of  nodes  are  expanded.  Other 
greedy  algorithms,  such  as  hill  climbing,  expand  the  best  node  at  each  level  but  retain  no  state 
information  on  parent  or  sibling  nodes.  This  can  result  in  an  algorithm  returning  a  local,  not 
absolute,  minimum  as  a  solution.  [Pearl,  1988:  35].  Examples  of  greedy  algorithms  include  the 
minimum  spanning  tree  algorithm  and  Dijkstra’s  algorithm  to  solve  the  shortest  path  problem 
[Hra.ssar(l  and  Hrantley,  1988:  80-87].  These  problems  are  not  NP-complete  since  the  algorithms 
can  solve  them  in  polynomial  time. 

2.5.2  Uninformed  Search  Another  name  for  uninformed  search  is  the  “brute  force”  method, 
riiis  technique  expands  every  node  without  considering  if  it  is  on  a  solution  path  or  not.  If  a 
.solution  is  found,  the  algorithm  stops.  'I'here  are  two  main  variations  in  uninformed  search;  depth 
first  search  (DPS),  and  breadth  first  search  (BFS). 
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2.5.2. 1  Depth  First  Search  (DFS)  Depth  first  search  works  by  always  generating  a 
child  node  from  the  most  recently  expanded  node.  This  continues  until  a  solution  is  found  or  the 
next  state  to  be  generated  is  not  feasible.  Thus,  priority  is  given  to  expanding  nodes  at  deeper 
levels  of  the  search  graph.  See  Figure  2.3  for  the  following  example.  The  search  starts  at  the  root 
node,  R,  and  continues  down  the  path  from  node  1  to  node  2  ending  at  node  4.  At  each  level  only 
one  node  is  expanded  before  going  on  to  the  next  level.  If  the  solution  is  not  on  the  path  expanded, 
no  solution  is  found  [Pearl,  1985;  36], 

LEVEL 
0 


1 

2 

3 


4 

Figure  2.3.  Depth  First  Search 

,‘\  variation  of  DFS  proposed  by  Korf  is  depth  first  iterative  deepening  ( DFID).  I  bis  algorithm 
begins  at  the  root  node,  level  0,  and  performs  a  DFS  to  level  one,  expanding  all  ncales  to  this  level. 
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If  a  goal  node  was  not  found,  discard  all  nodes  generated  and  start  over  at  level  0  and  perform  a 
DFS  to  level  two.  Continue  discarding  nodes  and  performing  depth  first  searches  until  a  solution  is 
found.  One  disadvantage  of  this  algorithm  is  that  it  performs  wasted  computations  by  discarding 
and  generating  the  same  nodes  repeatedly.  Korf  claims  that  for  large  problems,  the  number  of 
nodes  expanded  asymptotically  approaches  the  number  of  nodes  for  regular  DP^S.  He  states  that 
since  almost  all  work  is  performed  at  the  deepest  level  of  the  graph,  most  nodes  are  only  expanded 
a  few  times.  However,  Korf  assumes  the  cost  of  generating  a  node  is  cheap  while  the  cost  of  storage 
is  high.  This  is  not  always  the  case.  One  advantage  of  DP’ID  is  since  every  node  at  each  level  is 
expanded,  it  finds  the  shortest  path  solution.  The  other  advantage  is  that  only  small  amounts  of 
memory  are  required  since  only  the  path  to  each  node  and  the  cost  to  reach  the  node  is  stored 
[Korf,  1985:  98-106], 

2. 5. 2. 2  Breadth  First  Search  (BFS)  In  contrast  to  DFS,  BFS  assigns  a  higher  priority 
to  expanding  nodes  at  a  higher  level  in  the  search  graph.  The  list  of  nodes  to  be  expanded  can  be 
stored  in  a  first-in-first-out  (FIFO)  queue.  See  Figure  'lA  for  the  following  example.  The  search 
starts  at  node  R  at  level  0.  At  level  1,  nodes  1,  2,  and  3  arc  expanded  before  going  on  to  level  2. 
Since  it  expands  every  node  at  each  level  before  continuing  down  to  anothc-  level  of  the  graph,  the 
first  .solution  path  found  by  BFS  is  the  one  with  the  shortest  path  [Pearl,  1985;  42]  . 

2.5.2.-}  Backtracking  In  the  DFS  once  the  deepest  node  on  the  path  was  reached,  the 
search  ended  even  if  no  solution  was  foun<l.  To  continue  searching  for  another,  possibly  better, 
solution  the  algorithm  must  “backtrack”  back  up  the  |)alli  tf>  a  higher  level.  .\t  each  higher  level, 
the  node  is  checked  for  any  unex[)anded  child  nodes.  If  a  chihl  node  is  found,  a  new  DFS  is  st.arted 
down  that  path.  If  no  unex})and('d  child  node  was  found,  the  algorithm  backtrack.s  to  the  next 
higher  level.  This  continues  until  all  paths  have  been  searched.  In  this  rase,  the  list  of  nodes  to 
be  expanded  can  be  stored  in  a  last-in-first-out  (LIFO)  queue.  In  Figure  2.3,  if  node  4  was  not 
the  solution  ,  the  algorithtn  backtracks  to  node  3  and  checks  to  sec'  if  it  has  any  unc'xpanded  child 
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nodes.  In  this  example,  node  5  would  be  expanded  next  followed  by  the  nodes  in  increasing  order. 
Nodes  at  a  deeper  level  of  the  search  graph  are  still  given  a  higher  priority  for  expansion  than  nodes 
at  a  higher  level  [Pearl  1985:  36-41]. 

2. 5. 2. 4  Branch  and  Bound  To  keep  from  having  to  explicitly  explore  every  path  of 
the  irch  graph,  additional  information  must  be  stored.  For  example,  if  a  rna.ximum  cost  is  known 
after  the  first  branch  of  the  tree  is  explored,  this  cost  can  be  stored  and  used  to  limit  the  search 
down  any  other  branch.  If  a  lower  maximum  cost  is  found  in  another  branch,  this  new  value  is 
stored  as  the  new  maximum  cost.  Also,  if  additional  information  is  known  about  the  particular 
problem  being  solved,  a  heuristic  function  can  be  used  to  calculate  the  cost  of  continuing  to  search 
down  a  imrlicular  branch  of  the  tree.  .Vn  example  of  this  function  is  the  cost  of  reaching  a  node 
pins  a  cons(>rvative  estimate  of  the  cost  to  the  solution.  If  this  value  is  greater  than  the  known 
maximum  cost,  or  lower  bound,  then  the  search  would  not  continue  down  this  branch.  Instead  the 
algorithm  would  jump,  or  branch,  to  the  next  node  at  the  head  of  the  stack  or  queue  waiting  to  be 
explored.  In  this  way,  all  of  the  nodes  of  the  tree  do  not  have  to  be  explicitly  explored. 

Ilosworth  and  others  characterize  a  branch  and  bound  algorithm  into  the  following  four  main 

[)arts: 

1.  Expansion  procedure  —  A  method  to  create  a  node’s  children 

2.  Seb’ction  procedure  —  A  heuristic  such  as  DFS  or  BFS  to  tiecide  the  order  in  which  the  nodes 
are  e.xpanded 

3.  Bounding  rule  Does  the  cost  to  reach  a  particular  node  |)lns  the  estimated  cost  to  com- 
pleticju  for  that  node  e()ual  or  exceed  the  global  l>est  cost. 

1.  Termination  rule  -  Determination  of  whether  the  node  r<'presents  asolution.  When  searching 
for  an  optimal  solution,  termination  is  delayed  until  all  nodes  have  been  evaluated  either 
explicitly  or  implicitly. 
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(Pennington  and  others,  1988:  241] 


Using  a  combination  of  backtracking,  branch  and  bound,  and  DFS  an  optimal  solution  to 
the  search  graph  can  be  found  quicker  than  using  just  DFS.  An  advantage  to  this  method  is  it 
requires  less  memory  storage  than  other  search  techniques  since  only  the  paths  with  solutions  are 
stored  disadvantage  of  this  method  is  the  algorithm  can  take  longer  than  BFS.  For  example,  in 
Figure  2.3  the  search  works  from  top  to  bottom  then  left  to  right.  If  the  solution  was  in  the  path 
cont.iitiing  node  12,  most  of  the  nodes  expanded  were  not  on  the  solution  path  .  However,  if  an 
ojUitual  scjlution  is  reciuired,  all  branches  of  the  graph  would  have  to  be  searched,  or  bounded,  to 
validate  that  the  best  solution  was  found  [Korf,  1985:  99). 

In  coutra.st,  using  branch  and  bound  with  BFS  provides  a  solution  fpiicker  since  it  finds  the 
solution  with  the  shortest  jiath.  'I’he  time  complexity  is  also  at  least  However,  BFS 

requires  more  memory  storage  because  all  nodes  at  each  level  are  expanded  before  going  to  the 
next  level.  Since  this  recpiires  all  |)aths  of  the  search  graph  to  be  stored  until  a  solution  is  found, 
the  memory  requireun'iits  could  be  the  entire  search  space.  For  NP-cornplete  problems,  this  is  at 
le.ist  wheri'  r  is  a  constant  and  /(n)  is  a  function  of  the  number  of  inputs  into  the  |)roblem. 

XN'heii  searching  for  ati  optimal  solution,  BFS  continues  searching  the  graph  until  all  branches  have 
bei’ii  exploreil.  Only  when  the  cost  of  continuing  clown  a  path  exceeds  the  current  maximum  value 
IS  a  path  removed  from  memory.  Korf  points  out  that  many  times  the  memory  required  to  be 
stored  exceeds  the  memory  of  the  computer.  When  this  occurs,  the  jiroblem  is  not  .solvable  with 
the.se  terhni(|ues  [Ktirf.  198.5:  100-102]. 

J  7.  /  llfst  l  irsl  Search  bike  the  DFS  with  branch  and  bound,  best  first  .search  uses  an 
heuristic  to  calriilafe  the  ('stimated  cost  to  find  a  .solution  down  a  certain  path.  U'nlike  DF.S  which 
expatids  the  best  tiode  from  the  most  recently  expanded  node,  bc'st  first  search  expands  the  best 
tiode  III  the  entire  graph  Using  the  heuristic  information,  best  first  search  focuses  the  search  down 
the  path  which  provides  the  best  chance'  of  producing  a. solution 
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Figiirt’  2.5.  Hierarrhiral  Diag 


2.5.3. 1  Hierarchy  of  Algorithms  Pearl  describes  a  hierarcliy  of  best  first  algorithms 
b2ised  on  when  the  algorithm  is  terminated  and  how  the  cost  of  a  node  is  calculated.  This  hierarchy 
is  shown  in  Figure  2.5.  In  Figure  2.5,  d.t.  stands  for  delayed  termination,  *  signifies  an  optimal 
solution  is  found  if  one  exists,  and  r.w.c.  stands  for  recursive  weight  function.  Pearl  defines  a 
recursive  weight,  or  cost,  function  as  follows: 

A  weight  function  Wa{n)  is  recursive  if  for  every  node  n  in  the  graph 

fFoln)  =  F[E(ti)  :  iyo(«i )»n^FG(«2) . 'fV;(''i>)] 

where  ni.rio,  .  .  .,74  are  the  immediate  successors  of  n.  F(n)  stands  for  a  set  nf  local 

properties  characterizing  the  node  n.  F  is  an  arbitrary  combination  function,  moiiotonic 

in  its  Vf’r.O  arguments. 

Basically,  if  th<’  weight  of  a  node  is  recursive,  the  w'eight  is  a  function  of  the  weights  of  the 
nodes  in  its  path. 

An  optimal  solution  can  he  found  using  best  first  search  by  combining  it  into  ati  algorithtn 
with  branch  and  boutid  and  delaying  termination  of  the  algorithtn  initil  all  hratiches  ot  the  grajih 
have  been  evaluated. 

2. 5. 3. 2  .•!  *  One  algorithm  used  to  calculate  the  estimated  cost  to  a  .soliitioti  is  an 
additive  function  of  the  form 

/(n)  =  g(n)  +  h(n) 

where  g(ti)  is  the  cost  of  the  path  from  the  root  node  to  node  n 
atid  h(ti)  is  the  estimated  cost  from  node  n  to  the  solution  [Pearl,  1985:  75].  As  Figure  2.5  shows 
if  this  heuristic  function  is  used  with  best  first  algorithm  and  terniinatioti  is  delayed  to  search  for 
an  optimal  solution,  the  algorithtn  is  called  additive  optimal,  or  A*. 

Pearl  defines  a  heuristic  as  admissible  if 

h(n)  <  h  *  (n) 
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where  h(n)  is  the  estimated  cost  to  completion  and  h*(n)  is  tlie  actual  cost  to  completion  [Pearl, 
1985:  77].  II  an  admissible  heuristic  is  used  in  the  A*  algorithm,  you  are  guaranteed  to  always  find 
an  optimal  solution  if  one  exists  [Korf,  1985:  103], 

2. 5. 3. 3  A  *  Variations  Korf  suggests  using  a  variation  of  the  depth  first  iterative  deep¬ 
ening  algorithm  with  the  A*  algorithm  -ailed  IDA*.  At  each  iteration,  perform  a  DFS,  bounding 
the  path  when  the  f(n)  value  exceeds  a  given  threshold.  The  initial  threshold  is  the  estimated 
completion  cost  of  the  root  node.  The  threshold  used  for  the  next  iteration  is  the  minimum  cost  of 
all  values  that  exceed  the  current  threshold.  The  algorithm  ends  when  all  nodes  have  been  explored 
or  unexplored  nodes  exceed  the  cost  of  the  threshold. [Korf,  1985:  103].  Another  variation  of  IDA* 
is  to  set  the  threshold  by  the  number  of  levels  expanded.  For  example,  07i  the  first  iteration  expand 
all  nodes  to  level  3.  Then  on  the  second  iteration,  begin  at  level  0  and  expand  all  nodes  to  level  6, 
and  then  the  third  iteration  would  begin  at  level  0  and  expand  all  nodes  to  level  9. 

For  example,  Figtue  2.6  shows  the  partial  search  space  for  a  problem  using  cost  as  the  thresh¬ 
old  to  determine  which  nodes  are  expanded.  'I’he  root  node  generates  all  of  its  children  with 
estimated  cost  less  than  or  ecpial  to  the  estimated  cost  of  root  node.  On  the  first  iteration,  all 
nodes  at  level  1  are  generated  and  the  minimum  cost  is  now  110.  On  the  second  iteration,  level  1 
is  generated  again,  but  only  node  1  is  expanded.  Node  2  is  also  expanded  since  its  estimated  cost 
is  still  equal  to  110.  The  new  threshold  value  is  now  111  from  node  6.  Each  successive  iteration 
generates  level  1.  but  only  expands  nodes  with  cost  not  exceeding  the  threshold  value.  The  third 
iteration  expands  nodes  R,1.2,6:  generates  noiles  3,.5,7,8,|0,1 1;  and  the  new  threshold  is  112.  The 
fourth  iteration  expands  nodes  R.  1 ,2, 3, 5,6, 7, 8,  and  generates  nodes  ‘1,9,11,12,  and  13,  This  contin¬ 
ues  until  a  solution  node  is  found  which  becomes  the  new  threshold  and  the  process  continues  until 
all  branches  have  been  investigated, 

Korf  claims  most  of  the  work  by  ID.A*  generating  nodes  is  performed  at  the  bottom  of  the 
tree  so  the  number  of  nodes  generated  a.symi>totirally  approaches  the  number  of  nodes  gener.'t<'il 


2-19 


LEVEL 


Figure  2.6.  Iterative  Deepening  A* 
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by  A*.  Since  only  the  path  from  the  root  node  to  the  solution  node  must  be  stored,  lie  claims  to 
get  A*  speed  with  depth  first  search  memory  requirements  [Korf,  1985;  106]. 

Cvetanovic  and  Nofsinger  suggest  another  A*  algorithm  using  what  they  term  Continuous 
Diffusion.  This  algorithm  is  used  on  distributed  memory  parallel  computers  using  a  distributed  list 
algorithm.  This  algorithm  performs  a  parallel  A*  search,  but  after  expanding  a  set  number  of  nodes, 
processors  then  exchange  a  certain  number  of  nodes  from  their  list  to  be  expanded  with  their  nearest 
neighbors.  This  keeps  processors  from  expanding  nodes  with  higher  costs  while  a  neighbor  has  nodes 
with  much  lower  cost.  The  nodes  with  the  lowest  cost  diffuse  from  processor  to  processor  insuring 
the  best  nodes  are  being  expanded.  The  idea  is  to  keep  the  local  distributed  list  implementation 
as  close  to  a  centralizeil  list  implementation  as  possible.  Using  this  method,  they  claim  to  expand 
a  much  smaller  number  of  nodes  than  [DA*.  See  the  section  on  Parallel  .Architecture  for  an 
explanation  of  distributed  memory  and  nearest  neighbor  [Cvetanovic  and  Nofsinger,  1990:  87]. 

2.6  Suintnary  of  Search  Algorithms 

This  section  is  a  summary  of  the  search  algorithms  presented.  Table  2.6  is  only  a  generalized 
description  of  the  algorithms.  .Specific  problems  and  implementations  can  greatly  affect  the  memory 
and  time  utilization  of  the  algorithms.  For  example,  all  parallel  versions  of  A*  are  greatly  affected 
by  whether  the  list  of  work  to  be  done  is  maintained  on  a  centralized  or  distributed  list. 

Table  2.6  is  a  summary  of  some  important  papers  about  search  algorithms  and  NF’-complete 
[)roblems. 

SUMMARY 

.Algorithms  to  solve  NP-complete  on  serial  computers  are  well  known.  However,  NF’-complete 
algorithms  implemented  on  parallel  computers  have  been  studied  only  in  the  la.st  decade,  and  many 
fundamental  (piestions  reniain  unanswered.  New  techniques  for  load  balancing  and  communicating 
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Parallel  Search  Algorithms 

ALGORITHM 

MEMORY 

TIME 

COMMENTS 

Depth  First 

Requires  little  memory, 

Best  feature 

Varies  greatly  depending  on 
where  in  search  graph  a 
solution  is  located.  Can  require 
prohibitive  amount  of  time. 

Branch  and  bound  and 
backtracking  can  greatly 
reduce  time 

Breadth  First 

Can  require  prohibitive 
amounts  of  memory 

Varies  greatly 

depending  on  where  in  search 
graph  a  solution  is  located. 

Branch  and  bound  and 
backtracking  can  greatly 
reduce  time  and 
memory  required. 

Best  First 

Can  require  prohibitive 
amounts  of  memory.  Uses 
less  tlian  Breadth  F'irst 

Solutions  are  consistently 
quickest,  but  can  be  longer 
than  Depth  First 

Characteristics  depend 
on  tlie  variatic'i 
used 

A* 

Same  as  Best  First 

Same  as  Best  First 

IDA* 

Memory  the  same  as 

Depth  First 

Claimed  to  be  same 
as  A* 

Still  undecided  issues 
on  relative  speed, 
especially  in  parallel  version 

(^’ontinuous 
Diffusion  A* 

Same  as  Best  First 

Claimed  to  be  better 
than  IDA*,  definitely 
better  than  A* 

Parallel  version  only 

Dispute  about  relative 
merit  compared  to  IDA* 

Table  2.1,  Memory  and  'I'inie  Comparisons  of  Search  Algorithms 


Parallel  Search  Algorithms 


Year 

Investigators 

Description 

197f) 

Korf 

Theoretical  investigation  of  Depth  First 

Iterative  Deepening 

1988 

Pennington,  Bosworth,  VVlieeler, 
Stiles  and  Raghuram 

Branch  and  Bound  Algorithms  for 

Distributed  Database  Networks 

Jansen  and  Sijstermans 

Parallel  Branch  and  Bound  Algorithms 

mi 

Overvi'nv  of  hypercube  architectures 
with  2  e.xamples  of  use 

Parallel  Traveling  Salesman  Program 
and  factors  which  affect  speedup 

1990 

Li  and  VVa 

Good  di.scussion  of  anomalies  in 
parallel  search  algorithms 

1990 

Quinn 

Theoretical  and  measured  evaluation  of  different 
load  balancing  algorithms  for  the  hypercube 

1990 

Cvetanovic  and  Nofsinger 

Different  hypercube  load  balancing 
using  Conlinrious  Diffusion 

Table  2  2.  .Ap[)lirations  and  Im[>lenientations  of  Search  Algorithms 
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global  variables  are  among  the  main  area^  of  research.  Study  is  also  underway  on  how  parallel 
algorithms  w'ork  and  ways  to  increase  the  speedup. 
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III.  Methodology  and  Design 


3.1  Introduction 

This  chapter  discusses  the  methodology  used  in  this  research  and  the  preliminary  design  of  a 
parallel  A*  algorithm.  The  methodology  is  described  in  section  3.2  and  the  preliminary  design  in 
section  3.3.  Complexity  analysis  of  the  design  is  provided  in  section  3.4. 

3.2  Methodology 

Designing  and  implementing  a  complicated  sequential  algorithm  can  be  extremely  difficult. 
The  additional  complexities  of  parallel  algorithms  discus.sed  in  Chapter  II  accent  the  requirement 
for  a  systematic  approach  to  clesigning  parallel  algorithms.  For  all  algorithms  used  in  this  research, 
the  first  ste[)  wa.s  to  develop  a  thorough  understanding  of  the  problem.  Chapter  II  provides  the 
background  for  understanding  .NP-complete  problems,  search  algorithms,  and  computer  architec¬ 
ture. 

Next,  preliminary  and  detailed  algorithms  were  designed  using  a  top-down  api)roach.  Each 
function  and  data  structure  wsis  designed  to  allow  modification  or  incorporation  into  other  al¬ 
gorithms.  Many  of  the  functions  and  algorithms  are  designed  to  be  run  sequentially  on  parallel 
processors.  This  allows  use  of  a  personal  computer  using  BORLAND  C-|--|-  for  the  initial  imple¬ 
mentation  and  debugging. 

I  hen  the  .sequential  algorithms  and  functions  were  combined  and  imiilementcd  on  the  Intel 
iI’S('/2  hy|)ercube  and  tested  to  validate  proper  execution.  Many  test  cases  and  examples  were 
used  to  attempt  to  test  all  function.  Some  functions  were  of  a  size  or  importance  to  be  tested 
comi)letely.  For  example,  the  operation  on  the  cpieue  which  stored  the  work  to  be  performed  are 
crit  ical  to  the  proper  operation  of  the  algorithm  and  could  be  exhaustively  tested. 

Finally,  data  was  collected  and  analyzed  to  evaluate  the  performance  of  the  algorithm.  Per¬ 
formance  ttrelrics  discussed  in  the  next  sectir)n  were  the  tuain  m<'asur(*s  of  performance.  7'he  data 


was  evaluated  trying  to  understand  why  changes  to  the  algorithms  produced  these  results,  how  do 
the  different  algorithms  compare,  and  which  algorithm  would  be  better  for  different  problems. 


3.3  Metrics 

3.3.1  Speedup  Many  metrics  can  be  used  to  measure  the  effectiveness  of  a  parallel  algorithm. 
Chapter  II  gave  the  definition  of  speedup  and  efficiency  of  parallel  algorithms  as; 


S  —  d^serial/T^parallet 


and 

E  =  S/P 

These  are  good  metrics  when  trying  to  compare  the  total  time  to  run  different  algorithms  on  the 
same  type  of  computer.  However,  it  can  be  difficult  comparing  run  times  from  different  ty'pes  of 
computers  because  of  different  clock  rates,  communication  schemes,  memory  schemes,  and  many 
other  factors.  Therefore,  small  variations  in  run  times  on  different  computers  is  not  important. 

Normally,  the  best  speedup  achieved  is  linear.  For  example,  if  the  time  to  run  the  sequential 
program  is  100  seconds  and  the  parallel  program  time  is  50  seconds  using  2  processors,  the  speedup 
is  2.  Ideally,  if  4  processors  are  used,  the  parallel  time  would  decreeuse  to  25  seconds  for  a  speedup 
of  4.  Thus,  in  this  example  the  speedup  is  linearly  proportional  to  the  number  of  processors  used. 

However,  Miller  and  Penke  describe  many  limitations  on  the  achievable  speedup.  First,  the 
start u[)  and  termination  of  all  algorithms  arc  by  nature  sequential  and  cannot  be  parallelized. 
Another  limitation  on  the  speedup  is  the  extra  work  performed  by  the  parallel  search  algorithm. 
As  described  in  the  next  section,  parallel  search  algorithms  perform  extra  work  as  compared  to  the 
se(iuential  algorithm.  .Mso,  there  are  time  costs  a.s.sociated  with  communication  and/or  Tiiemory 
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contentions  in  parallel  computers  not  found  on  sequential  computers.  All  of  these  problems  decrease 
the  amount  of  speedup  achieved  [Miller  and  Penke  1989:  133]. 

Sometimes  the  speedup  is  greater  than  linear.  This  is  normally  considered  an  anomaly  and  not 
true  speedup.  Li  and  Wa  provide  the  following  reasons  or  conditions  which  can  result  is  super-linear 
speedup: 

1.  There  are  multiple  solution  nodes.  This  can  allow  the  parallel  search  algorithm  to  find  a 
solution  before  the  sequential  algorithm. 

2.  The  heuristic  function  is  ambiguous  and  allows  for  selection  of  more  than  one  path. 

3.  The  rule  used  to  eliminate  nodes  isn’t  consistent  with  the  heuristic  function. 

4.  The  tree  structure  of  tlie  .search  space  causes  nodes  not  expanded  in  the  sequential  algorithm 
to  be  expanded  when  using  multiple  processors. 

5.  The  feasible  solutions  are  not  generated  in  the  same  order  when  different  number  of  processors 
are  used. 

As  they  point  out,  different  combinations  of  these  conditions  cause  the  tree  to  be  searched  in  differ¬ 
ent  orders  depending  on  the  number  of  processors  used  [Li  and  Wah,  1990:  21-29].  If  the  parallel 
algorithm  has  super-linear  speedup  only  on  particular  data  sets,  then  these  cases  are  probably 
anomalies.  However,  if  the  parallel  algorithm  consistently  has  super-linear  speedup  over  all  data 
sets  ,  the  sequential  algorithm  is  not  designed  well  and  can  be  improved. 

3.3.2  Nodes  Expanded  Another  metric  which  is  less  dependent  on  the  type  of  computer  used 
is  the  number  of  nodes  expanded  by  the  algorithm.  Using  Figure  2.3  as  an  example,  if  node  4  was 
a  .solution  (goal)  node,  then  the  minimum  number  of  nodes  would  be  expanded.  However,  if  node 
10  wa.s  the  goal  node,  then  expansion  of  all  nodes  not  in  its  path  was  wcuited  work.  Obviously,  the 
more  efficient  algorithms  expand  fewer  nodes  not  on  the  solution  path. 
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Figure  3.1  represents  the  total  search  space  for  a  given  problem  with  the  numbers  representing 


locations  of  optimal  solutions. 


Figure  3.2  shows  the  portion  of  the  space  searched  by  DFS  to  find  solution  1.  If  3  was  the 
only  solution,  practically  the  entire  space  would  be  searched  even  using  a  branch  and  bound  DFS 
algorithm.  If  a  “near”  optimal  solution  was  found  early  in  the  search  in  Figure  3.2,  then  most  of 
the  search  space  could  be  implicitly  checked  without  having  to  expand  the  nodes.  This  could  save 
time,  but  there  is  no  way  to  guarantee  a  near  optimal  solution  will  be  found  early  in  the  search. 


nCFTH  MRST  SHARCH  USING  BRANT  AND  BOUND 
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P  igiiru  3  2  Scarrh  Spare  For  IVpth  First  Searrh 


Figure  3. 3  shows  the  search  space  searched  to  find  a  solution  using  BFS.  Notice  solution 
2  is  found  first  with  very  little  of  the  search  space  explored.  However,  nearly  the  entire  search 
space  must  be  explored  if  3  is  the  only  solution.  Even  with  backtracking  and  branch  and  bound 
algorithms,  the  program  might  take  too  long  to  run  in  both  of  these  cases. 

A  better  way  to  explore  the  search  space  is  shown  in  Figure  3.4.  No  matter  where  a  solution 
is  located  in  the  space,  the  search  concentrates  on  the  path  to  it.  While  this  .seems  obvious,  the 
hard  part  is  designing  an  algcrithm  which  explores  the  search  space  in  this  manner.  As  seen  from 
thes('  examples,  the  number  of  nodes  expanded  can  be  a  “good”  metric  to  determine  the  efficiency 
of  the  s('arch  algorithm. 

However,  a  problem  can  arise  when  using  just  the  number  of  nodes  as  the  only  metric.  Miller 
and  Penke  state  a  sequential  version  of  a  search  algorithm  normally  expands  fewer  nodes  than  a 
[larallel  version.  This  is  becau.se  many  nodes  in  the  sequential  version  are  not  evaluat('d  because 
their  costs  exceed  the  global  best  cost.  On  a  parallel  search,  the  higher  levels  of  the  search  space 
can  have  many  nodes  which  are  less  than  the  global  best  cost.  Some  of  these  nodes  may  l)e 
ex|)anded  needle.ssly  before  the  global  best  co.^t  is  reduced.  Also,  the  sequential  version  always 
has  the  compl<’t<’  list  of  open  nodes,  or  nodes  waiting  to  be  expanded.  Therefore,  the  secjuential 
algorithm  can  always  choose  the  best  node  to  expand  next.  On  one  version  of  the  parallel  A* 
algorithm,  the  complete  open  list  is  maintained  on  a  central  processor,  i.e.,  a  centralized  list,  or 
partial  lists  are  kept  on  each  individual  processor.  This  means  that  some  proces.sors  are  expanding 
node'-  which  are  not  the  global  best  [Miller  and  Penke,  1989:  133],  d  herefore.  while  the  number  of 
no<les  ex[>anded  is  le.ss  for  the  .sequential  version,  the  total  time  can  be  much  greater,  d'his  is  one 
reason  why  more  than  on«;  metric  should  be  used  to  evaluate  an  algorithm. 


Figure  3.3,  Search  Spar*-  F<ir  Hreadlli  First  Searcli 
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Understanding  the  Problem 


3.^.1  Traveling  Salesman  Problem  The  A*  algorithm  is  a  metliod  used  to  determine  the 
solution  to  problems  requiring  a  search  of  all  i)ossible  solutions.  To  study  its  characteristics  and 
performance  on  a  distributed  memory  parallel  computer,  a  family  of  problems  is  required  to  be 
solved  using  the  A*  algorithm,  for  this  research,  I  chose  the  traveling  salesman  problem  (TSP). 
One  reason  the  TSP  vvcis  selected  to  use  with  the  A*  algorithm  was  because  it  is  one  of  the 
most  widely  studied  families  of  NP-complete  problems.  Since  it  so  widely  studied,  the  sequential 
implementation  of  TSP  is  well  understood  and  there  are  many  good  sequential  algorithms  already 
developed  to  compare  the  parallel  algorithm  against.  .Also,  since  any  NP-complete  problem  can  be 
mapped  to  any  other  NP-complete  problem  in  polynomial  time,  all  NP-complete  problems  could 
be  solved  using  the  TSP. 

The  TSF’  consists  of  a  graph  of  cities  and  the  associated  costs  to  travel  between  cities.  In  the 
TSP  graph,  the  cities  are  representeti  by  the  vertices  of  the  graph  and  the  distances  between  cities 
by  the  edges  of  the  graph.  If  every  city  has  a  direct  path  to  every  other  city,  the  graph  is  completely 
connected.  If  the  grapli  i;  traversed  an<l  every  verte.v  is  visited  exactly  once  and  the  beginning  and 
final  vertices  are  the  same,  then  the  traversal  is  calle<l  a  four  [Christofides,  1975:  6-9].  The  goal 
of  the  TSP  is  to  begin  at  an  arbitrary  city  and  complete  a  tour  of  the  cities  traveling  the  shortest 
possible  distance  [Brassard  and  Brantley,  1988:  103].  The  cost  of  the  tour  is  the  sum  of  the  costs 
of  the  edges  of  the  tour. 

If  the  cost  of  traveling  between  cities  t  and  j  is  stored  in  a  cost  matrix  at  location  i],  then  the 
I  SP  can  Ix'  stated  mathematically  as: 

Minimi/e 


'C' 
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subject  to 


Xij  =  1,  iev,  =  (3-2) 

j€V  tev 

E  E  ^ij  ^  E  for  all  S  C  V,  S  0,  (3-3) 

ies  j€V-S 

Xij  =  0  or  1,  i,j  G  V,  (3.4) 

where  xij  —  1  if  edge  <  i,  j,  >  is  in  the  solution  and  0  otherwise.  Equation  3.2  and  3.3  ensure 
that  the  solution  is  a  tour  and  equation  3.4  eliuiinates  the  possibilities  of  subtours  [Rottman,  1990: 
97-98]. 

To  solve  the  TSP,  all  possible  combinations  or  paths  between  cities  must  be  checked  to  find 
the  optimal  solution.  This  is  accomplished  by  starting  at  an  arbitrary  city  and  adding  cities  one 
at  a  time  to  the  list  of  cities  visited.  After  each  city  is  added,  the  list  is  checked  to  see  if  the  cities 
are  a  tour  and  an  admissible  heuristic  function  is  used  to  determine  the  cost  of  continuing  down 
this  path  to  completion.  If  the  cities  are  a  tour,  their  cost  is  compared  to  the  best  cost  found  so 
far  and  the  smaller  value  is  retained  as  the  new  best  cost.  If  the  cities  are  not  a  tour,  then  the 
estimated  cost  returned  by  the  heuristic  function  is  compared  to  the  best  cost.  If  the  estimated 
cost  is  greater  than  the  best  cost,  then  this  path  is  removed  from  the  list  of  solution  paths  to  be 
e.xplored  since  its  cost  is  higher  than  a  solution  already  found.  , 

fl.4-2  A*  The  A*  algorithm  is  a  specialized  form  of  the  best  first  algorithm.  As  discussed 
in  (diapter  II,  using  an  admissible  heuristic  guarantees  finding  an  optimal  solution  if  one  exists. 
Also  discussed  in  Chapter  11  was  that 


f{n)  =  g{n)  -f  b{u) 
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where  y(n)  is  tlie  cost  of  the  path  from  the  root  node  to  node  n  and  h(n)  is  the  estimated  cost  from 
node  n  to  the  solution  [Pearl, 1985:  75].  The  sequential  algorithm  is  described  by  Pearl  as  follows: 

1.  Put  the  start  vertex  s  on  OPEN  list 

2.  If  OPEN  is  empty,  exit  with  no  solution  found 

3.  Remove  from  OPEN  and  place  on  CLOSED  list  a  node  n  for  which  /  is  a  minimum 

4.  If  n  is  a  goal  node,  exit  with  the  solution  obtained  by  tracing  back  the  pointers  from  n  to  s 

5.  Otherwise  expand  «,  generating  all  children  and  attach  to  them  pointers  back  to  n.  For  all 
children  ii  of  n: 

(a)  If  71  is  not  already  on  OF^EN  or  CLOSED,  estimate  /i{n)  and  calculate  f(n). 

(b)  If  n  is  already  on  OPEN  or  CLOSED,  direct  its  pointers  along  the  path  yielding  the 
lowest  (/(n). 

(c)  If  n  required  pointer  adjustment  and  was  found  on  CLOSED,  place  it  on  OPEN 
b.  Co  to  ste[)  2 

[FVarl,1985:  (>4-65]. 

7  .5  Heuristic  Estimate  of  h{n) 

■Vccording  to  Felten,  the  selection  of  the  proper  heuristic  function  to  estimate  h{n)  is  critical. 
.\  good  heiiri.stic  allows  the  program  to  prune  non-optinial  branches  of  the  .search  tree  early  in  the 
search  [Felten,  1988:  1501].  Kumar  states  that  if  h(n)  is  very  close  to  the  actual  cost,  then  most 
nodes  expanded  will  be  on  the  path  to  the  optimal  solution  producing  a  very  efficient  algorithm 
[Kumar,  1990:  44], 

Two  of  the  most  widely  used  admissible  heuristics  to  generate  h(ii)  are  the  minimum  spanning 
tree  and  the  a,s.signment  problem,  Roth  are  polynomial  time  algorithms,  but  according  to  Kumar, 
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the  assignment  problem  is  one  of  the  best  heuristics  for  use  with  the  TSP  [Kumar,  1988:  124]. 
Therefore  the  heuristic  function  used  in  these  algorithms  is  the  assignment  problem.  Christofides 
defines  the  assignment  problem  as  follows: 


Given  a  number  of  resources  and  a  number  of  requesters  of  those  resources,  and  the 
profit  or  usefulness  of  each  resource  to  each  requester  in  the  form  of  a  rating  matrix 
where  elements  a,  j  is  the  profit  of  assigning  resource  i  to  requester  j,  the  problem  is  to 
assign  each  resource  to  one  and  only  one  requester  in  a  way  such  that  a  given  measure 
is  optimized  [Christofides,  1975:  287], 


This  definition  implies  the  number  of  requesters  and  the  number  of  resources  are  the  same. 
In  that  case,  the  solution  can  be  found  in  polynomial  time.  However,  in  cases  where  the  number  of 
requesters  and  resources  are  not  equal  can  be  solved  by  adding  dummy  resources  or  requesters  to 
make  the  matrix  square.  The  problem  is  now  combinatoric  in  nature  and  in  the  class  of  NP-complete 
problems.  When  used  in  the  TSP,  the  assignment  problem  has  the  same  number  of  resources  and 
requesters. 

The  assignment  problem  can  be  viewed  as  a  matching  of  bipartite  graphs.  A  bipartite  graph 
is  defined  by  Christofides  as: 


a  non-direcled  graph  G  =  (X,A)  is  said  to  be  bipartite  if  the  set  X  of  its  vertices 
can  be  partitioned  into  two  subsets  A'a  and  Xj  so  that  all  arcs  have  terminal  vertex 
in  Xa  and  the  other  in  A';,.  .4  directed  graph  is  said  to  be  bipartite  if  its  non-directed 
counterpart  Gi  is  bipartite  [Christofides,  1975,  40]. 


Mathematically,  the  assignment  problem  can  be  stated  as  follows: 


Given  there  are  N  requesters,  W  resources,  and  a  cost  matrix  C 


,  c.j  >  0  for  i  —  1 , 2, . . . ,  iV  and  j  =  1, 2, . .  . ,  IT 


(3.5) 


find  an  assignment  matrix 


A'lk.yll 


(3.0) 
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such  that 


Hij  =  { 1  if  resource  i  is  assigned  to  requester  j 

{0  otherwise 


(3.7) 


subject  to  the  constraints 

n  n  N  W 

CijXij  =  minimum  Xij  ~  x,j  =  1  {3-8) 

i=lj=l  1=1  j=l 

3.5.1  Assignment  Problem  Example  As  an  e.xample  of  the  assignment  problem,  Figure  3.5 
.4  is  a  0/1  matrix  repre.sentation  of  the  requesters  and  the  resources.  A  “1”  in  position  of  the 
matri.x  means  resource  j  can  bo  assigned  to  requester  i  and  a  means  it  cannot  be  assigned. 
Anotlior  matrix,  Figure  3.5  B,  is  constructed  having  the  costs  of  each  resource  being  assigned  to 
each  requester  .  After  performing  an  element  by  element  multiplication  of  the  two  matrices,  the 
final  0/1  tnatrix  showing  the  cost  to  perform  each  task  is  calculated  and  shown  in  Figure  3.5  C. 

3.5.2  .Assignment  Problem  Algorithm  While  there  are  many  algorithms  to  solve  the  assign¬ 
ment  problem,  one  of  the  most  widely  used  is  the  Hungarian  Method  developed  by  Kuhn.  This 
algorithm  finds  independent  .sets  which  have  minimal  (or  maximal)  costs.  Bourgeois  and  Lassalle 
define  a  set  of  elements  of  a  matrix  to  be  independent  if  none  of  the  elements  are  in  the  same  row 
or  column  [Bourgeois  and  Lassalle,  1971:  14].  This  restricts  the  allocation  of  one  resource  to  one 
requester  and  vice  versa.  The  independent  sets  are  found  by  subtracting  the  smallest  element  of  a 
row  in  the  cost  matrix  from  all  other  elements  in  the  .same  row.  Then  the  smallest  element  in  each 
column  in  the  cost  matrix  is  subtracted  from  each  element  in  that  column.  This  results  in  every  row 
and  column  having  at  least  one  null  element  for  a  N  X  N  matrix.  A  row  or  column  is  covered  when 
it  contains  only  one  null  element.  The  smallest  element  in  the  uncovered  rows/columns  is  then  sub¬ 
tracted  from  all  the  uncovered  elements  and  added  to  null  elements  of  the  covered  rovvs/columns. 
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Figure  3.5.  Assignment  Problem  F"  ,st  Matrix 
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The  process  is  repeated  until  all  resources  are  covered  by  independent  elements,  in  which  case  an 
optimal  solution  has  been  found  [Kuhn,  1955:  25)  . 

The  algorithm  derived  from  the  word  description  is: 

1.  Construct  a  cost  matrix  Co,  where  each  Cij  is  the  cost  of  the  link  in  the  bipartite  graph 
between  subgraphs  Xi  G  A'a  and  Xj  G  -Yj. 

2.  Subtract  the  minimum  element  in  each  row  of  Cg  from  every  element  in  that  row. 

3.  Subtract  the  minimum  element  in  each  column  of  Co  from  every  element  in  that  column. 

4.  For  every  row  in  the  matrix  with  only  one  null  element,  mark  the  null  element  and  cross  out 
any  other  null  element  in  that  column. 

5.  If  all  rows  are  covered,  i.c.,  contain  a  marked  null  element,  then  this  corresponds  to  an  optimal 
solution  and  exit  the  algorithm.  If  all  rows  are  not  covered,  go  to  the  next  step. 

6.  For  every  column  in  the  matrix  with  only  one  null  element,  mark  the  null  element  and  cross 
out  any  other  null  element  in  that  row. 

7.  If  all  columns  are  covered,  exit  with  an  optimal  solution.  If  all  columns  are  not  covered,  go 
to  th<>  next  step. 

8.  Mark  any  row  which  does  not  contain  any  marked  null  elements. 

9.  .Mark  columns  which  have  unmarked  null  elements  in  a  marked  row. 

10.  Mark  the  rows  that  have  a  marked  null  element  in  a  marked  column. 

1 1.  Repeat  ste[>s  9  and  10  until  nothing  else  can  be  marked. 

12.  Cover  all  unmarked  rows  and  all  marked  columns  by  drawing  a  line  through  them. 

13.  Find  the  minimum  uncovered  element  and  subtract  it  from  all  other  uncovered  t'lemenis  and 
add  it  to  the  elements  that  are  covered  by  both  a  row  and  column  cover,  i.e.,  the  intersi'ction 
of  the  lines. 
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14.  Repeat  steps  2  through  13  until  an  optimal  solution  is  found.  The  cost  of  the  optimal 
solution  is  found  by  summing  the  individual  costs  of  the  marked  null  elements  in  the  original 
cost  matrix  C'o. 


Using  the  cost  matrix  in  Figure  3.6,  an  example  assignment  problem  is  solved.  Notice  in  step 
4  that  the  first  null  element  to  be  marked  was  in  the  last  row.  Crossing  out  the  two  other  nulls  in 
the  same  column  allowed  the  remaining  null  in  the  first  row  to  be  marked.  Now  the  other  null  in 
the  second  row  could  be  crossed  out.  Step  8  in  r'igure  3.7  begins  the  process  of  generating  other 
possible  combinations  of  null  assignments.  Step  13  in  Figure  3.8  shows  the  new  cost  matrix  after 
performing  steps  8-13.  With  this  new  cost  matrix,  the  algorithm  is  started  again  at  step  2.  Figure 
3.9  step  5  shows  the  row  and  column  null  elements.  The  fmal  solution  is  determined  from  assigning 
the  null  elements  in  a  column  (RESOURCE),  to  the  row  it  is  in  (REQUESTER).  In  this  example 
the  solution  is: 
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3.6  High  Level  Design 


Design  of  an  algorithm  can  be  divided  into  three  broad  categories,  high  level  design,  low  level 
design,  and  implementation.  High  level  design  consists  of  the  major  steps  required  to  p  ;rform  the 
task  and  doesn’t  consider  architectural  peculiarities  of  the  machine.  Low  level  design  adds  more 
details  to  the  algorithm  and  begins  to  customize  it  for  a  particular  architecture.  Implementation 
provides  a  finished  program  capable  of  running  on  a  computer. 

Designing  a  parallel  algorithm  has  all  the  difficulty  of  designing  a  sequential  algorithm  with 
many  other  considerations  besides.  Parallel  considerations  include,  mutual  exclusion  of  data,  con¬ 
trol  of  the  algorithm,  timing  between  proces.sors,  load  balancing,  and  decomposition  techniques. 
Only  control  of  the  data  and  tlecornposition  techniques  arc  discus.sed  in  this  chapter.  The  other 
considerations  are  discussed  in  Chapter  IV  during  low  level  design. 

3.6.1  Sequential  TSP  Algorithm 

3.6.1. 1  Terms  and  Definitions  In  all  the  algorithms  discus.sed,  there  arc  some  common 
terms  and  definitions  which  are  given  now.  Chapter  IV  also  provides  more  details  of  the  data 
structures  used  and  gives  examples  for  some  of  tin'  terms. 

•  num.cities  -  The  number  of  cities  to  lx*  visited  in  the  proldern 

•  NODE  -  A  structure  which  has  the  fields  vector,  cost,  and  link.  Vector  contains  the  cities 
which  have  Ix'en  visiteil,  link  points  to  the  next  .\ODL  in  the  OPEN  queue  and  cost  is  the 
cost  of  visiting  the  cities  in  vector 

•  MKS'f  -  structure  of  type  NODE  which  has  the  fields  vector,  cost,  and  link  and  contains 
the  curri'iit  best  .solutirni  used  to  bound  the  search 

•  OPE,N  -  fhe  open  list  kept  in  an  array  of  NODEs  as  ciescribed  above.  References  ran  Ix' 
made  to  any  element  of  the  (pjene  and  field  of  the  .NOf)E  by  using  the  element  number  ami 
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the  field  name.  For  example,  to  compare  the  cost  of  the  node  on  the  front  of  the  queue  against 
the  current  BEST. cost,  the  following  is  used: 

if(OPEN[(i.frout].rost  <  BEST.cosi) 

•  free  list  -  A  subset  of  the  OPEN  array  which  links  elements  of  the  array  available  to  store 
NODES 

•  VVORK_REQUEiS'r  -  Label  used  in  the  algorithm  to  identify  messages  sent  by  Workers  to 
the  Controller  requesting  a  NODfi  from  OPEN  for  expansion 

•  node_status  -  Status  of  Worker,  either  busy  or  available  for  work 

•  NEW. NODE  -Isabel  used  in  the  algorithm  to  iilentify  messages  sending  NODEs  from  the 
Workers  to  the  Controller  for  insertion  in  the  OPEN  list 

•  DONE  -  Label  used  in  tlu'  algorithm  to  iih'ntify  the  terminate  message  sent  by  the  Controller 
to  the  Workers 

•  EXP.\ND.NODE  -  Label  used  in  the  algorithm  to  identify  messages  sending  a  NODE  from 
the  OPEN  list  on  the  Controller  to  a  Worker 

3.6. 1.2  TSP  .A  LCiORfTH M  The  basis  for  the  parallel  TSP  search  is  the  sequential 
TSP  algorithm.  The  sequential  algorithm  was  modified  from  the  one  developed  by  ILd'  Mike 
Bottman  to  .solve  the  TSP  on  the  if’SC/l  hyperrube  [Holtman  1990:  97-l'J.'j].  This  algorithm  uses 
the  A*  definition  in  Pearl  [Pearl,  1988:  ti  l]  and  I  lie  definit  ion  of  TSP.  The  assignment  problem  is 
the  function  used  to  calculate  the  heuristic  /if”),  'fhe  sequential  algorithm  is: 

Sequential  TSP 

Build  cost  matrix 

Perform  depth  first  search  of  one  node  to  obtain  initial  BES'f  .cost 
Cenerate  starting  node  in  search  tree 
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Loop  while  (cost  of  NODE  on  front  of  OPEN  j  BEST. cost) 

Remove  NODE  from  OPEN 

Loop  until  all  cities  have  been  checked 

Add  a  city  to  end  of  partial  tour  of  NODE  removed  from  OPEN 
If  city  has  not  been  visited  in  this  partial  tour 
Calculate  h{n)  and  /  for  new  partial  tour 
If  (new  NODE. cost  <  BEST. cost)  and  (cities  visited  is  a  tour) 
BEST  =  NODE 

If  (NODE. cost  i  BEST. cost)  and  (cities  visited  is  not  a  tour) 
Insert  new  NODE  into  OPEN 
END  Loop  until  all  cities  have  been  checked 
END  Loop  while  (cost  of  NODE  on  front  of  OPEN  <  BEST. cost 
Calculate/Collect  results 
FiND Sequential  TSP 


This  algorithm  removes  a  NODE  from  the  front  of  OPEN  and  adds  one  city  to  the  NODE,  vector 
partial  tour.  It  checks  the  resultant  partial  tour  to  see  if  the  added  city  had  already  been  visited 
atui  if  the  rc'sultant  tour  was  a  complete  tour.  If  the  city  was  alrearly  in  NODh’. vector,  the  NODE 
is  iliscarilecl.  If  the  city  wa.s  not  in  NODE. vector,  was  a  complete  tour,  and  at  a  lower  cost  than 
the  current  best  cost,  then  the  node  becomes  the  new  BES'L.  If  it  was  not  a  tour  and  the  cost 
was  less  than  BES'L. cost,  the  node  was  inserted  into  the  OPEN  queue  for  possible  selection  for 
expansion.  I'his  ciuitinues  until  all  possible  cities  have  been  added  to  the  original  NODE. vector 
partial  tour.  1  he  algorithm  then  removes  the  next  NODE  from  OPEN  and  begins  the  cycle  again. 

I  his  continues  until  OPf’N  is  empty  or  the  cost  of  the  NODE  at  the  front  of  OPEN  is  greater  than 
BES'L  .cost. 

For  example  see  Figure  IF  10.  'Lhe  search  begins  by  initially  placing  node  1  on  OPEN.  All 
possible  chddren  of  node  1  are  generated  with  their  as.sociated  estimate  cost  to  completion,  /,  and 
plac<'d  on  OPEN.  Lhe  NODE  with  the  lowest  /  value  is  removeil  from  OPEN  and  t'xpanded.  After 
eaeh  node  is  expanded,  the  children  are  placed  on  OPEN  and  the  cycle  is  repeated.  Notice  that 
node  has  the  partial  tour  of  1  -  ,'l  and  each  child  of  node  3  adds  one  city  to  that  tour  and  generatc's 
,1  ttew  cost  Notice  al.so  that  node  7  only  generates  two  children  because  the  other  possible  cities 
have  already  been  visited.  'Lhe  NODEis  are  kept  in  a  priority  (pieue  so  the  best  node  is  expanded 
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next.  The  sorted  OPEN  list  is  the  key  element  to  making  this  a  best  first  strategy  and  the  function 
/  =  g[n)  +  h{n)  makes  it  A*. 

3.6.2  Decomposilton  Techmqties  In  determining  the  high  level  design,  the  first  thing  to 
consider  is  how  the  problem  is  to  be  decomposed  onto  the  parallel  computer.  According  to  Ragsdale, 
two  of  the  main  decomposition  techniques  are  data  and  control  decomposition. 

Data  decomposition  is  where  every  processor  has  the  same  task  and  operates  on  different  data 
sets.  Information  may  or  may  not  be  passed  between  the  processors  as  the  programs  are  executed. 
Ragsdale  provides  three  examples  where  data  decomposition  is  well  suited; 

•  Problems  where  the  data  is  static.  Examples  include  matrix  operations  or  finite  difference 
calculation  on  a  mesh. 

•  Problems  where  the  data  structure  is  dynamic,  but  is  somehow  tied  to  a  single  entity.  Exam¬ 
ples  iiichule  large  multi-part  problems  with  easily  generated  sub-problems. 

•  Problems  wlu're  the  domain  is  fixe<l,  but  the  computation  within  the  various  regions  of  the 
domain  is  dynamic.  I’or  example,  the  .search  space  of  an  NP-complete  problem  is  bounded  (no 
matter  how  large),  but  areas  of  the  .search  graph  can  be  dynamically  generated  for  exploration 

[Hag.sdah-.  1990:  -t.  l  -  l.-a]. 

Control,  or  functional,  decomposition  focuses  on  the  flow  of  control  in  an  algorithm.  Ragsdale 
lists  two  types  of  control  decomposition.  The  first,  functional  decomposition,  looks  at  a  problem 
as  a  set  of  operations  or  functions.  I  he  functions  are  divided  up  and  put  on  separate  processors. 
Data  which  reipiires  a  particidar  function  must  be  sent  to  the  processor  which  has  that  lunction. 

fhe  other  contrf)!  decomposition  technique  is  called  Worker/Manager.  One  process  is  the 
'managi'r''  and  farms  out  tasks  to  the  “worker"’  processors.  The  manager  keeps  track  of  the  work 
to  lie  done  and  assigns  the  work  to  the  workers  as  they  become  idle  [Ragsdale,  1990:  4.5]. 
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For  the  first  design,  the  Worker/ Manager  decomposition  is  used  to  control  the  overall  flow 
of  the  algorithm  and  a  data  decomposition  is  used  on  the  worker  nodes.  This  allows  the  use  of 
the  centralized  list  for  load  balancing  as  discussed  in  Chapter  II.  Using  data  decomposition  on  the 
worker  processors  allows  the  large  data  sets  to  be  manipulated  without  having  to  constantly  pauss 
information  between  proces.sors.  This  allows  different  branches  of  the  search  tree  to  be  explored 
simultaneously. 

3.6.3  High  Level  Algorithms  Ragsdale  suggests  the  following  general  approach  to  designing 
parallel  algorithms  using  data  decomposition  : 

•  Distribute  the  data 

•  Restrict  the  computation  so  that  each  processor  updates  its  own  data 

•  Put  in  the  communication 


[Ragsdale,  1990;  .o.l]. 

Using  the  sec|uential  algorithm  as  a  starting  point,  two  algorithms  are  developed.  The  first 
algorithm  is  the  /nanager  which  distributes  the  data  and  controls  the  overall  flow  of  the  algorithm. 
The  second  algorithm  is  the  worker  and  performs  the  actual  computations  required  to  execute  the 
A*  algorithm.  The  third  step  in  the  design  process,  communication,  is  accomplished  in  both  the 
worker  and  manager  algorithms.  The  designs  developed  are  very  similar  to  the  sequential  TSP 
design  with  the  functions  divided  between  the  Control  and  Worker  algorithms. 

The  high  level  Control  design  is: 


High  Level  TSI^  Control  Design 
Receive  cost  matrix  from  host 
Cenerate  starting  node  in  search  tree 
While  (nodes  still  left  on  OPFN) 

Send  work  to  idle  Workers 
Fml  While  (nodes  still  left  on  OPFN) 
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Terminate  Workers 
Collect  results 

END  High  Level  TSP  Control  Design 


The  Control  routines  are  very  simple,  well  known  types  of  routines  and  so  little  time  will  be 
spent  discussing  them.  The  Control  algorithm  contains  most  of  the  initialization  and  termination 
routines.  The  only  Control  routines  that  are  not  used  in  the  sequential  TSP  algorithm  are  “send 
work  to  idle  Workers”  and  “terminate  Workers”.  These  routines  are  peculiar  to  a  parallel  imple¬ 
mentation  of  the  algorithm  and  provide  control  information  to  the  workers.  On  a  serial  computer, 
the  control  is  provided  by  the  sequential  nature  of  the  algorithm.  Since  the  statements  are  exe¬ 
cuted  sequentially,  there  is  no  conflict  over  which  statement  is  executed  next  or  when  the  program 
terminates. 

The  high  level  Worker  design  is: 


High  Level  TSP  Worker  Design 
Receive  cost  matrix  from  host 
While  (not  terminated) 

Request  and  receive  work  from  Control 
Pei  form  A*  search 

Broadcast  solution  if  better  than  current  best  solution 
Send  local  OPEN  list  to  CONTROL 
Send  results  to  Control 
End  While  (not  terminated) 

END  High  Level  TSP  Worker  Design 


The  heart  of  the  Worker  algorithm  is  the  “Perforin  A*  search”  routine.  This  routine  is 
the  sequential  version  of  the  TSP  A*  algorithm  executed  on  multiple  processors.  The  routines 
which  determine  the  efficiency  and  speedup  of  the  parallel  algorithm  are  “Request  and  receive 
work  from  Control'’.  “Broadcast  solution”,  and  “Send  local  OPEN  list”.  These  routines,  along 
with  their  corresponding  routines  on  Control,  control  when  and  how  information  is  passed  between 
processors.  These  routines  are  further  discussed  in  low  level  design. 
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3. 7  Summary 


This  chapter  presented  the  methodology  used  to  attack  the  research,  the  metrics  used  to 
evaluate  the  efficiency  of  the  algorithm-'  developed,  a.id  the  high  level  design.  Also  discussed  was 
a  more  detailed  explanation  of  the  of  the  TSP  problem.  The  preliminary  design  was  partitioned 
into  two  separate  algorithm  based  upon  the  “worker/manager”  concept  of  control  decomposition. 
The  worker  algorithm  was  further  rerined  usi.ig  data  decomposition.  The  next  chapter  provides 
detailed  design  of  the  algorithms  along  with  the  data  structures  and  functions  used  to  implement 
the  main  algorithms. 


IV.  Low  Level  Design  and  hnplementation 


^.1  Introduction 

In  this  chapter, the  data  structures  used  in  the  programs,  the  routines  wliich  comprise  the 
programs,  and  the  rationale  behind  tlie  decisions  to  use  each  routine  or  structure  is  discussed. 
Diagrams  showing  the  relationship  between  programs  and  routines  is  provided  in  Appendix  A. 

.j.d  Data  Structures 

The  liasis  for  most  of  the  data  stored  in  these  programs  is  the  following  C  language  structure: 


tvpedef  st  met  { 

int  veclor[Vli;(  'TOR-SIZK+ 1]; 
int  cost: 
ini  link. 

}  NODK; 


[  he  type  .\’ODL  l\i\s  three  fields,  each  of  which  is  of  type  integer.  'I'lie  first  field,  vector,  is 
an  array  of  sizi'  VlT.'d'DH .SIZE.  VECTOR.SIZE  equals  the  number  of  cities  in  the  problem.  The 
order  of  the  cities  in  .\ODE. vector  is  the  order  in  which  the  cities  are  visited.  The  second  field  in 
NODE,  I'ost,  contains  t  he  cost  of  the  partial  or  complete  tour  stored  in  vector.  The  final  field,  link, 
IS  used  as  a  [lointer  to  the  next  NODE  when  NODEs  are  ston'd  on  the  OI’EN  list  in  a  queue. 


VECTOR  COST  LINK 


ARRAY  OF  INTEGERS, 

NUMBER  OF  ELEMENTS 
EQUALS  NUMBER  OF 
CITIES 

INTEGER 

INTEGER 

Figure  '1.1.  Structure  of 'I'ype  NODE 


.Another  data  structure  is  the  (|ueiie  used  l.f)  store  no'lcs  waiting  to  lx-  ex|ran(led.  File  queue 
IS  an  array  of  .N'ODFs  linked  using  the  NODE, link  field  tt)  deteriiiiiie  which  element  in  the  array  is 
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next  on  the  queue.  Also  in  the  array  is  a  list  of  elements  which  have  no  data  and  are  considered 
free  elements.  Initially,  all  the  elements  are  on  the  free  list.  Associated  with  the  queue  are  pointers 
to  the  front  of  the  queue  and  free  lists  with  variables  to  show  the  queue  status  and  queue  length. 
.\ODEs  are  inserted  into  the  queue  using  an  insertion  sort  so  that  the  NODE  with  the  smallest 
cost  is  at  the  front  of  the  queue.  The  free  list  is  kept  as  a  last-in-first-out  queue. 

.\n  array  was  used  to  implement  the  OPEN  queue  for  two  reasons.  The  main  reason  is  that 
the  hypercube  is  a  distributed  memory  architecture  and  each  processor  has  its  own  distinct  memory. 
This  architecture  does  not  allow  information  passing  using  a  linked  list  since  the  pointers  to  memory 
locations  have  no  meaning  on  a  different  processor.  Therefore,  if  a  linked  list  is  used,  the  queue 
on  each  Worker  must  be  put  into  an  array  before  transmitting  it  to  the  centralized  list  kept  on 
the  Control  proc('s,sor.  The  other  reason  is  that  while  the  OPEN  queue  can  become  unmanageably 
large  for  .A*  search  problems,  the  manner  in  which  the  heuristic  estimate  for  h(n)  is  calculated  is 
relatively  accurate  and  keeps  the  OPE.N  list  from  becoming  very  large.  This  allows  the  array  to  be 
of  a  manageable  .siz(',  around  9,000  elements,  and  still  be  large  enough  to  contain  the  OPEN  list. 

In  Figure  4.2,  the  array  is  comprised  of  10  elements,  six  on  the  queue  and  four  on  the  free 
list.  The  front  of  the  (pieiie  is  pointed  to  by  q.front  and  the  front  of  the  free  list  by  freeptr.  The 
status  is  given  by  qjitatus  as  bu.sy  and  the  qJength  is  six. 

.After  .NEW. NODE  is  generated  and  inserted  into  the  queue,  the  array  and  associated  variables 
are  as  shown  in  Figure  4. 4.  Kemoving  a  NODE  froni  the  queue  results  in  the  configuration  shown 
in  Figure  1.4. 
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rile  other  data  structure  is  the  matrix  used  to  store  the  cost  of  traveling  between  cities.  This 
is  a  square  matrix  with  the  row/column  lengths  equal  to  the  tiumber  of  cities  in  the  problem.  Each 
element  of  the  matrix  is  an  integer  value.  For  example,  in  Figure  4.5  the  cost  of  going  from  city 
5  to  city  ;$  is  73  while  the  cost  of  going  from  city  3  to  city  5  is  87.  Notice  the  cost  matrix  is  not 
symnu'tric.  Changes  to  the  data  structures  are  discussed  later  in  the  chapter  as  appropriate  for 
algorithm  changes. 

i  Loir  I. err  I  Desiijii 

The  low  level  design  provitles  more  details  of  the  program  and  considers  the  architecture  of 
the  computer  on  which  it  will  be  running.  Since  those  programs  will  run  on  the  iPSC/2  hypercube, 
its  message  pa.ssing  protocol  and  communication  time  must  be  considered.  .Also,  the  hypercube 
does  not  have  an  elficient  interface  between  the  u.ser  and  processors.  For  this  reason  an  additional 
algorithm,  Host,  is  run  on  the  host  proccs.sor  to  provide  the  user  interface. 

1  his  si'ction  also  describes  in  greater  detail  the  high  level  design  developed  in  Chapter  Ill. 

1  he  Control  program  is  discussi'd  first,  then  the  Worker  progratn,  and  finally  the  functions  used  to 
implement  specific  actions  of  the  programs.  For  each  program,  a  pseudo  code  program  is  given. 

■j.-L  I  Hnndoni  ('lit/  Gevrrntor  A  random  number  generator  is  used  to  build  the  cost  matrix 
which  IS  then  stored  in  a  file  to  Ix'  read  into  the  main  program  later.  This  was  done  to  allow  re|)eated 
testing  of  lilt'  same  [iroblem  using  different  size  and  possibly  ilifferent  types  of  hypercubes.  The 
first  I'lement  stored  in  the  file  is  the  number  of  cities  in  the  problem.  The  random  number  generator 
assigns  values  of  from  0  91)  for  the  costs  to  travel  between  cities. 

/,.i  2  Control  I’roi/rtnti  As  developed  in  Cha[)ter  III.  the  Control  high  level  (h'sign  is  as 


folh  >w. 


High  Level  TSP  Control  Design 
Build  cost  matrix 

F’erform  depth  first  search  on  one  node  to  determine  initial  BEST. cost 
Generate  starting  node  in  search  tree 
While  nodes  still  left  on  OPEN  or  any  Worker  BUSY 
Send  work  to  idle  Workers 
Terminate  Workers 
Collect  results 

END  High  Level  Control  Design 


fhe  high  level  Control  design  is  further  developed  by  adding  communication  requirements 
and  more  specific  details  to  the  algorithm  and  is  shown  below: 


Low  Level  TSP  Control  Design 
Beceive  cost  matrix  from  host 

Perform  depth  first  search  on  one  node  to  determine  initial  BEST. cost 

Generate  starting  node  in  search  tree 

While  ((OPEiN  not  empty)  or  (any  Worker  busy))  loop 

While  ( WORK.REQl.iEST  message  from  Workers)  loop 
Receive  WORK-REQUEST  message 
Identify  Worker  which  sent  message 
Set  appropriate  node.status  to  available 
End  While  (WORK. REQUEST  from  Workers)  loop 
While  ((NEW-NODE  message  from  Workers)  and  (OPEN  not  full))  loop 
Reci'ive  NEW-NODE  message  from  Worker 
Insert  NODE  into  OPEN 

End  While  ((NEW-NODE  from  Workers)  and  (OPEN  not  full)) 

While  ((OPEN  not  empty)  and  (Worker  is  available)  and 
(front  NODE  on  OPEN  cost  <  best. cost))  loop 
Delet  "  NODE’  from  front  of  OPEN 
Si'tid  .NODEii  to  Worker  for  expansion 
find  While  ((OPEiN  not  empty)  and  (Worker  is  available)) 

If  (.NEAV-BEiST  message  from  Worker) 

Receive  NEAV.BF^ST  message  from  Worker 
If  ( N  EiW.BEiS'f  cost  <  current  best. cost) 
current  best  =  NEiW-BEST 

Prune  Of’EiN  list  of  NODEis  with  costs  >  best. cost 
Eind  If  ( N EAVCBEIST.cost  <  current  best. cost) 

End  If  (NEW-BEST  from  Worker) 

End  Whih'  ((OE’EiN  not  empty)  or  (any  Worker  busy))  loop 
Send  DONE,  message  to  all  Workers 
Collect  results  from  Workers 
END  Lou:  Level  Control  Design 


As  long  as  the  OPEN  list  is  empty  or  the  Workers  are  not  all  idle,  the  Control  processor 
continually  polls  the  receive  buffers  of  the  iPSC/2  for  a  message  from  the  Worker  processors  and 
takes  appropriate  action  when  a  message  is  received.  For  example,  if  a  NEW.BESl'  message  is 
received,  the  cost  is  compared  to  the  current  BEST. cost  and  then  the  OPEN  list  is  pruned  of 
unnecessary  NODES.  When  the  while  loop  is  exited,  a  terminate  message  is  sent  to  all  VV'orkers. 
Finally,  data  is  collected  from  the  Workers. 

If  more  than  one  Worker  is  available  to  send  work  to,  the  algorithm  selects  the  proces.sor  with 
the  lowest  number.  For  example,  if  processors  3, 4, and  5  ar<-  all  available,  processor  3  and  then 
processor  4  receive  work.  If  processor  3  requests  work  again  before  processor  5  receives  work,  it 
will  receive  the  work  before  processor  5.  This  skews  the  efficiency  of  the  individual  processors  so 
lower  number  processors  have  higher  efhciencies. 

Terminating  a  centralized  list  \*  algorithm  requires  that  the  Worker  processors  be  idle  and 
the  OPFIN  list  be  empty,  idle  Workers  are  determined  by  the  Worker  Busy  variable.  When  work 
is  sent  to  a  Worker,  busy  is  set  to  “true'’,  and  set  “false”  when  a  work  request  is  received  by  the 
Control  processor.  I'he  OPEN  list  is  checked  at  the  beginning  of  each  loop  through  the  algorithm 
to  ensure  it  has  valid  work.  If  either  the  OPEN  list  is  not  empty  or  any  Worker  processor  is  not 
idle,  the  algorithm  continues, 

T  f.  ?  Worker  Program  The  high  level  design  from  Chapter  3  for  the  Worker  is; 

High  lA'vrl  I'SP  \i'oikrr  Design 
Build  cost  matrix 
While  (not  terminat('d) 

Bequest  and  refeiv('  work  from  Control 
Perform  A*  .search 

Broaticast  solution  if  better  than  current  best  .solution 
Send  local  OPEN  list  to  CONTBOL 
Send  results  to  Control 
End  While  (not  terminated) 

END  High  Level  ISP  U'erfer  Design 


Like  the  Control  algorithm,  the  Worker  algorithm  is  further  developed  in  the  low  level  design. 


Low  Level  TSP  Worker  Design 
Receive  cost  matrix 

While  (not  terminated  by  DONE  message) 

Send  WORK-REQUEST  message  to  Controller 
Receive  EXPAND-NODE  message  from  Controller 
Loop  until  all  cities  have  been  checked 

Add  a  city  to  end  of  cities  which  have  been  visited 
If  new  city  has  not  been  visited  in  this  partial  tour 

Calculate  the  cost  (h{n)  and  /(«))  for  new  partial  tour 
If  (new  NODE. cost  <  BEST. cost)  and  (NODE. vector  is  a  tour) 
BEST  =  NODE 

Broadcast  NEW_BEST  to  all  processors 
If  (NODE. cost  <  BEST. cost)  and  (NODE. vector  is  not  a  tour) 
Insert  new  NODE  into  OPEN 
END  (Loop  until  all  cities  have  been  checked) 

END  (Loop  while  (not  terminated  by  DONE  message)) 

Send  results  to  Controller  or  Host 
END  Low)  Level  TSP  Worker  Design 


This  algorithm  is  very  similar  to  the  sequential  TSP  developed  in  Chapter  3.  See  Figure  3.10 
for  the  detailed  explanation  and  for  an  example  of  this  algorithm.  The  main  differences  from  the 
sequential  TSP  algorithm  are  tlie  communication  between  processors  and  the  algorithm  control  is 
provided  by  the  Control  algorithm. 

4-3.4  Host  I’rogram  An  algorithm  to  provide  interface  with  the  user  is  provided  by  the 
Host  program  which  runs  on  the  iPSC/2  system  resource  manager  (SRM),  or  host  processor.  This 
algorithm  prompts  the  user  for  information  needed  to  run  the  TSP  program,  performs  initialization, 
loads  the  Control  and  Worker  i)rocessors,  and  displays  final  results.  The  design  for  this  algorithm 

is: 


Lon-  Level  TSP  Host  Design 
Print  initial  messages 

Request  and  receive  file  name  wliere  cost  matrix  is  stored 
Copy  cost  matrix 

Load  Control  and  Worker  programs 
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Send  cost  matrix  to  Control  and  Worker  programs 
Receive  results  from  Control  and  Worker  programs 
Print  results  and  termination  messages 
END  Low  Level  TSP  Host  Design 

4.3.5  Subroutines  Much  of  the  work  in  the  Control  and  Worker  main  programs  is  performed 
using  calls  to  subroutines  or  functions.  This  section  discusses  the  functions  used  in  both  Control 
and  Worker  programs. 

The  most  important  function  is  the  assignment  problem  used  to  calculate  h(n).  This  al¬ 
gorithm  is  discus.sed  in  detail  in  chapter  III.  Another  function,  Tour,  traces  through  each  city  of 
NODIC. vector  to  see  if  the  path  goes  through  each  city  only  once  and  ends  at  city  1.  It  returns  a 
boolean  flag  stating  whether  NODE. vector  is  a  tour.  The  algorithm  is: 


To  u  r 

Initialize  test  array  of  size  num.cities  to  0 
Set  Tour  flag  to  FALSE 
Mark  city  1  as  visited  in  test  array 
While  (city  not  visited  twice)  loop 
Co  to  next  city  in  NODE. vector 
Mark  city  as  visited  in  test  array 
End  While  (city  not  visited  twice) 

If  (all  cities  visited  once)  and  (end  at  city  1) 

Set  Tour  to  TRUE 

End  If  (all  cities  visited  once)  and  (end  at  city  1) 
END  Tour 


The  function  to  determine  if  the  city  add(  d  to  the  end  of  NODE. vector  is  a  feasible  selection 
is  in. path.  'This  function  traces  through  NODE. vector  to  determine  if  the  city  has  already  been 
visited  in  this  partial  tour.  It  returns  a  boolean  flag  stating  whether  the  city  has  already  been 
visited.  I'he  algorithm  is  : 

In. path 

Set  In. path  flag  to  FALSE, 
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While  (cities  not  visited  in  NODE. vector)  loop 
If  (city  in  NODE. vector  =  city  being  added) 

Set  In.path  flag  to  TRUE 
End  If  (city  in  NODE. vector  =  city  being  added) 
End  While  (cities  unvisited  in  NODE. vector) 


Because  the  C  language  does  not  support  direct  copying  of  arrays,  a  function  called  Copy-node 
was  made.  This  function  copies  the  array  stored  in  NODE. vector  to  another  variable  of  type  NODE. 
The  other  fields  in  the  NODE  structure  are  also  explicitly  copied.  The  algorithm  is: 


Copy-node 

While  (unvisited  elements  in  NODE-1. vector)  loop 
NODE-2. vector  =  NODE-1. vector 
End  While  (unvisited  elements  in  NODE-1. vector) 
NODE.2.cost  =  NODE-1. cost 
END  (.'opy.tiode 


Unlike  a  .sequential  algorithm,  several  processors  could  locate  solutions  which  are  better  than 
the  current  Ix'st  solution  and  broadcast  it  at  relatively  the  same  time.  The  broadcasted  message 
is  d.nO  bytes  and  according  to  Bomans  and  Roose,  this  size  message  takes  approximately  800  p. 
seconds  to  transmit.  Because  of  the  near  simultaneous  broadcasts  and  communication  delays,  a 
|)rocessor  could  receive  a  NEW-BEST  message  which  was  higher  than  the  current  BEST. cost  stored 
at  that  proces.sor  [Bomans  and  Roose,  1989:  16].  To  insure  the  best  solution  is  stored  in  BEST, 
every  NEW-BEST  message  is  compared  against  BEST. cost  ami  the  smaller  value  is  returned.  The 
algorithm  to  perform  this  is  Get-best  and  is: 


( id -best 

While  (receiving  NEW-BEST  messages)  loop 
If  (NEW-BEST.cost  <  BEST. cost) 

BEST  =  NEW-BE.ST 
End  If  (NEW-BEST.cost  <  BEST.cost) 
Erul  While  (NEVV.BEST  message) 

END  (dl-hest 
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The  OPEN  list  is  stored  in  an  array  of  NODEs.  Four  functions  control  all  actions  associated 
with  data  manipulation  of  the  OPEN  queue.  The  first  function,  qJnit,  initializes  all  elements  in 
the  array  including  the  NODE. vector  array  within  each  element.  It  also  initializes  the  link  and  cost 
fields  of  NODE  in  each  element  of  OPEN.  Finally,  all  other  variables  associated  with  the  OPEN 
queue  are  initialized.  The  algorithm  for  qJnii  is: 


q^inif 

For  (all  the  elements  on  OPEN) 

For  (all  the  elements  in  NODE. vector) 

Set  .NODE. vector  field  to  0 
End  For  (all  the  elements  in  NODE. vector) 

Cost  =  INEINITY 
Link  =  next  element  of  OPEN 
End  For  (all  the  elements  on  OPEN) 

Set  link  field  of  la.st  element  on  OPEN  to  (end  of  file  marker) 

qjitatus  — -  E.VIP'FY 

qJength  =  0 

Pointer  to  free  list  =  0 

Pointer  to  OPEN  list  =  (etid  of  file  marker) 

End  (all  the  elements  on  OPEN) 

END  q.ini/ 


The  NODE,-,  are  deleted  from  OPE.N  by  the  algc.ithm  delcie.q.  This  algorithm  deletes  the 
NODEs,  changes  the  queue  length  variable,  and  adjusts  the  pointers  to  the  front  of  the  OPEN 
queue  and  free  list.  Error  checking  is  also  p-'rformed  to  provide  a  warning  message  if  the  algorithm 
attempts  to  delete  a  NODE  from  an  empty  queue.  Finally,  the  status  of  the  queue  is  checked  and 
changed  ;us  needed,  d'he  algorithtn  is: 


deletrj] 

If  the  (|ueue  is  empty  [)rint  warning  message 
If  the  queue  is  not  empty 

Decrement  <|ueue  length  by  1 

Remove  NODE  from  the  front  of  the  queue 

Adjust  pointer  to  the  front  of  the  queue  to  point  to  the  new  front 
Adjust  pr)iufer  to  the  front  of  the  free  list  to  point  to  the  new  front 
.Adjust  the  (|iieue  status  as  appropriate 
end  (If  th('  (pieue  is  not  empty) 
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END  dele^e^q 


The  third  function  which  manipulates  the  queue  is  insert. priority.  This  algorithm  performs 
an  insertion  sort  of  the  NODEs  into  the  OPEN  queue,  changes  the  queue  length  variable,  and 
adjusts  the  pointers  to  the  front  of  the  OPEN  queue  and  free  list.  This  algorithm  also  performs 
error  checking  and  updates  the  queue  status.  The  algorithm  is: 


msert.prtorily 

If  the  queue  is  full  print  warning  message 
If  the  queue  is  not  full 

Increment  queue  length  by  1 
Insertion  sort  the  NODE  into  the  OPEN  queue 
■Adjust  pointer  to  the  front  of  the  queue  to  point  to  tiie  new  front 
Adjust  pointer  to  the  front  of  the  free  list  to  point  to  the  new  front 
■Ailjust  the  queue  status  as  appropriate 
end  (If  the  queue  is  not  full) 

END  inseri.p  riorilij 


The  final  functi(<n  to  work  witii  the  (jueue  is  prune.q.  After  a  NEVV.BEST  solution  is  found, 
this  function  is  used  to  hound  the  search  by  removing,  or  pruning,  states  from  the  search  space 
tree.  This  is  done  by  removing  from  OPEN  all  NODEs  which  have  a  cost  greater  thati  or  equal 
to  the  cost  of  the  new  solution.  Notic'-  that  .NODEs  with  equal  cost  are  also  eliminated  since  the 
algorithm  only  .searches  for  a  best  .solution  not  all  best  solutions.  The  .ilgorithm  is: 


pru nr.q 

Iraverse  the  OPEN  (|ueue  until  BESl'.cost  ijeq  '^'ODFvcost 
Delete  all  NODEs  in  OPEN  beyond  the  current  NODE 
■Adjust  point''rs  in  the  free  list 
■Adjust  the  (|u<'ue  status  as  appropriate 
END  prutir_(i 
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/,.4  Disiributed  List 


Quinn  theoretically  proved  and  Abdelrahinan  and  Miidgc  demonstrated  that  as  the  number 
of  processors  increased,  the  communication  to/from  the  master  processor  allocating  work  to  slave 
processors  eventually  becomes  a  bottleneck.  To  eliminate  this  bottleneck,  distributed  list  algorithms 
are  used  [Quinn,  1990:  385]  [Ahdelrahrnan  and  Mudge,  1988:1496-1498], 

In  this  section,  the  modifications  re<iuired  to  change  the  parallel  TSF  control  from  a  central¬ 
ized  list  to  a  distributed  list  (DL)  are  discussed.  This  entails  changing  the  high  level  decision  of 
using  the  functional  worker/manager  decomposition  and  n.se  only  data  decomposition  on  all  pro- 
(■ess(}r.s.  'I  his  eliminates  the  Control  processor,  but  adds  another  Worker  processor.  The  global 
OPKN  (pieue  is  now  maintained  in  a  local  OlTiN  qucnc  on  each  processor.  Extra  communication 
ljetw('('n  [)roce.ssors  is  required  to  perform  the  tasks  previously  done  by  the  Control  processor  such 
as  load  balancing  and  program  termination.  This  section  looks  at  inqdementing  distributed  list 
queues  with  and  without  load  balancing. 

TTf  Without  Load  Batannng  There  are  two  main  methods  used  to  implement  a  dis¬ 

tributed  list  (lueuf',  4  he  first  method  generates  and  ilistribntes  an  initial  work  load  and  then  no 
load  balancing,  or  work  sharing,  is  performed.  Work  which  is  generated  by  a  jjroces.sor  stays  on 
that  |irocessor.  When  a  processor  finishes  its  work,  it  remains  idle  until  all  processors  finish.  If  the 
vork  geiuTated  by  the  processors  is  not  approximately  equal,  processors  may  be  idle  for  a  relative!;, 
long  (K'riod  of  time  waiting  for  all  the  proces.sors  to  finish  [Ma  and  fathers,  1988:  1509-1511]. 

fhe  new  ilesign  for  distributed  list  ISF  without  load  balancitig  is: 

7. VP  ll'er7;c/'  Without  Loud  Balancing  Design 
Heceive  cost  matrix  from  host. 

Perform  de|)th  first  search  on  one  node  to  deteniime  initial  Hli.S 'f  rost 
fleiierate  starting  node  in  search  tree 

Distribute  descendents  of  initial  node  among  all  |>rocessors 
While  (not  terminated  by  IXtM,  message) 

While  (OPEN  not  K,\1P  1  Y) 
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Remove  NODE  from  OPEN  queue 
Loop  until  all  cities  have  been  checked 

Add  a  city  to  end  of  cities  in  NODE. vector  which  have  been  visited 
If  new  city  has  not  been  visited  in  this  partial  tour 

Calculate  the  cost  (/i(n)  and  /)  for  new  partial  tour 
If  (new  NODE. cost  <  BEST. cost)  and  (NODE. vector  is  a  tour) 
BEST  =  NODE 

Broadcast  NE\V_BEST  to  all  processors 
If  (NODE. cost  <  BEST. cost)  and  (NODE. vector  is  not  a  tour) 
Insert  new  NODE  into  OPEN 
END  (Loop  until  all  cities  have  been  checked) 

END  While  (OPEN  not  EMPTY) 

Terminate  the  processors 

END  Loop  while  (not  terminated  by  DONE  message) 

Send  results  to  Host 

END  TSP  Worker  without  Load  Balancing  Design 


.\gain,  this  is  just  the  sequential  TSP  algorithm  with  communication  and  control  to  allow  the 
parallel  operation.  After  the  children  of  the  initial  node  are  generated,  t  hey  are  distributed  equally 
among  the  processors.  Each  processor  then  performs  the  sequential  TSP  until  all  of  the  NODEs 
on  its  local  OPEN  have  been  e.xplored. 

To  terminate  the  parallel  program,  all  proces.sors  must  be  idle.  This  is  determined  by  sending 
a  message,  RING,  which  is  only  received  when  the  processor  is  idle.  The  body  of  the  RI.NG  me.ssage 
is  empty  and  just  the  fact  that  the  message  w’as  received  is  significant.  The  processor  identified 
by  the  hypercube  operating  system  as  node  0  initiates  RING  when  its  OPEN  queue  is  empty  and 
sends  it  to  the  next  logically  numbered  processor.  Once  a  proces.sor  is  idle,  it  receives  the  RING 
and  pa,sses  it  to  the  next  processor.  When  processor  0  receives  the  RING  again,  all  processors  are 
idle  and  a  DONE  message  can  be  sent  terminating  all  the  processors. 

i’  DL  With  Load  Balancing  The  main  deficiency  of  the  DL  without  load  balancing  algo¬ 
rithm  is  the  that  work  could  be  unevenly  distributeil  and  .some  processors  are  idle  will  others  still 
have  large  amounts  of  work  to  perform.  .As  Ma  and  oth<-rs  show%  the  efficiency  of  distributed  lists 
without  load  balancing  can  he  very  Icnv  [Ma  and  olliers.  1988:  l.bOO-ir)!  1],  'Lite  obvious  sfilution 


is  to  have  an  idle  processor  re(|iiest  work  from  a  busy  |)rocessor.  The  iille  processor  first  requests 


work  from  its  nearest  neighbors  and  then  request  work  from  all  other  processors  one  at  a  time. 
Once  an  idle  processor  receives  work,  no  more  requests  for  work  are  issued.  The  busy  processors 
must  periodically  check  its  receive  buffers  to  see  if  it  has  received  a  work  request.  The  only  way  a 
processor  can  become  idle  is  if  all  processors  are  either  idle  or  do  not  have  enough  work  to  share. 

Felten,  Ma,  Penky  and  Miller,  Cvetanovic  and  Nofsinger,  and  many  others  discuss  when  it 
is  appropriate  for  a  processor  to  share  work.  They  all  agree  there  is  a  trade  off  between  sharing 
the  work  to  keep  a  processor  from  being  idle,  the  communication  overhead  involved  with  the  work 
sharing,  and  the  possibility  of  a  processor  sharing  too  much  work  and  having  to  immediately 
request  work  itself.  They  state  the  measure  of  when  to  share,  /?,  is  problem  specific  and  must  be 
determined  experimentally.  [Felten  1988,  I50d]  [Ma  and  others,  1988:  1507]  [Miller  and  Penky, 
1989:  133]  [Cvetanovic  and  Nofsinger,  1990:  87]  .  Since  nothing  was  found  in  the  literature  to 
provide  any  guidance  to  the  factors  which  influence  j3,  this  research  investigates  these  factors. 

According  to  Felten,  there  are  two  requirements  to  terminate  a  DL  process  with  load  balancing 
on  a  hypercube.  'The  first  reciuirement  is  the  same  as  with  the  DL  without  load  balancing  that  all 
processors  be  idle.  The  second  is  that  all  messages  have  been  received  [Felten,  1988:  1502].  The 
re<iuirement  to  have  received  all  me.ssages  insures  no  work  was  distributed  to  a  processor  but  not 
received  by  that  processor.  To  keep  track  of  the  number  of  me.ssages  sent,  each  processor  keeps  a 
local  count  of  the  messages  sent/received.  A  variable  is  increnienteri  when  a  message  is  .sent  and 
decremented  when  a  message  is  rec<'ived.  To  terminate  the  process,  the  local  message  count  can  be 
any  tiumber,  but  the  global  message  count  mxisl  equal  zero,  lb  accomplish  this,  the  Ping  message 
body  from  the  DL  without  load  balancing  is  changed  to  a  data  structure  of  the  form: 

typedef  struct  { 

int  message.count ; 

int  num.done; 

}  FINISllFD: 
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This  termination  algorithm  differs  from  Felten’s  termination  in  the  method  used  to  determine 
if  all  processors  are  idle.  Like  the  DL  without  load  balancing  algorithm,  the  DL  with  load  balancing 
algorithm  uses  processor  0  to  again  initiate  the  RING  when  it  becomes  idle  and  send  it  to  the  next 
logically  numbered  processor.  If  a  busy  processor  receives  the  RING,  it  just  sends  it  to  processor 
0  and  continues  working.  When  an  idle  processor  receives  the  ring,  it  increments  the  number  done 
and  adds  its  message  count  to  the  total  message  count.  When  processor  0  receives  the  RING,  it 
checks  FINISHED  to  validate  that  all  processor  are  finished  and  all  messages  received.  If  either 
condition  is  not  met,  processor  0  again  sends  the  Ring. 

Felten’s  algorithm  differs  by  having  all  processors,  busy  or  idle,  send  the  RING  to  the  next 
processor.  When  processor  0  receives  the  RING,  it  checks  to  see  if  any  processor  is  busy.  Again, 
if  eitlier  condition  is  not  met,  processor  0  again  sends  the  Ring.  In  the  DL  with  load  balancing 
algoritiim.  at  most  one  busy  processor  is  interrupted  by  the  RING  message.  In  Felten’s  algorithm, 
at  host  only  one  busy  proce.ssor  is  interrupted  and  at  worse  [(numberof  nodes)  —  1]  busy  processors 
are  interrupted  [Felten,  1988:  1503] 

Heard  modified  Felten’s  algorithm  for  termination  and  also  used  it  for  load  balancing.  A  busy 
processor  handles  the  HING  similar  to  the  Felten  algorithm  and  sends  it  to  the  next  processor. 
However,  when  an  idle  processor  receives  the  RING,  it  differs  from  Felten’s  algorithm  in  that  the 
RING  now  allows  the  processor  to  request  work.  Since  only  one  proce.ssor  can  have  the  RING  at 
a  titiie.  only  one  processor  can  be  requesting  work.  Especially  on  computers  with  a  large  number 
of  processors,  mai’v  processors  could  be  idle  waiting  to  recpiest  work  from  busy  processors  [Beard, 
1990:  1.25-1.28].  As  implemented  in  this  research,  idle  processors  immediately  and  independently 
re(|uesl  work  from  their  neighbors. 

near<l  also  implemented  the  RING  algorithm  as  a  separate  process  on  each  processor  and  used 
roiit('xf  switching  between  the  search  algorithm  and  the  RING  algorithtn  [Heard,  1990:  ■1.25--1.28]. 
I  bis  ;qi|)ears  to  be  an  inefficient  way  to  perform  termination  and  load  balancing.  Both  of  tln.se 


tasks  occur  only  at  very  specific  points  in  the  algorithm  and  are  easily  controlled.  Also,  context 
switching  requires  calls  to  the  operating  system  which  stop  and  start  the  difTerent  algorithms. 
All  this  incurs  overhead  not  required  if  the  functions  are  called  from  the  main  algorithm  without 
context  switching. 

The  DL  with  load  balancing  is: 


TSP  Worker  With  Load  Balancing  Design 
Receive  cost  matrix  from  host 

Perform  depth  first  search  on  one  node  to  determine  initial  BEST. cost 

Generate  starting  node  in  search  tree 

Distribute  children  of  initial  node  among  all  processors 

Wliile  (not  terminated  by  DONE  message) 

While  (OPEN  not  EMPTY) 

Remove  NODE  from  OPEN  queue 
Check  for  RING 

Check  for  WORK. REQU ESI'  message  from  another  processor 
[yoop  until  all  cities  have  been  checked 

.\dd  a  city  to  end  of  cities  in  NODE. vector  which  have  been  visited 
If  new  city  has  not  been  visited  in  this  partial  tour 

Calculate  the  cost  (h(n)  and  /)  for  new  partial  tour 
If  (new  NODE. cost  <  RES'I'.cost)  and  (NODE. vector  is  a  tour) 
BEST  =  NODE 

Broadcast  .NEW.BEST  to  all  processors 
If  (NODE. cost  <  BES'I'.cost)  and  (.NODE. vector  is  not  a  tour) 
lu.sert  new  NODE  into  OPEN 
END  (booi)  until  all  cities  have  been  checked) 

END  While  (Ol'EN  not  E.MPTV) 

.Send  WOBK.REtjEES  1  toother  processors 
lerniinate  the  [)r(jcessors 

I'iND  I,(jo()  while  ( not  terminated  by  DONE  mes.sage) 

Send  results  to  Host 

END  7.S7’  Warkt  r  With  Load  Halam  ing  D(  sign 


file  algririlliin  to  lerminate  the  process  is  as  follows; 


7V  niiinnlr 

If  ( my  node  nimiber  is  0 ) 

Inil  ialize  I  he  BI.NC;  messase 

While  (not  all  processors  .are  icile)  er  (not  all  messages  rc'ceived) 
(heck  fcyr  W  ( )B  K  _H  E()I  ES  I  message  f'om  another  processor 
Selicl  t  111  KING  to  node  1 


I  P.) 


Receive  the  RING 

END  while  (not  all  processors  are  idle)  or  (not  all  messages  received) 
Send  the  DONE  message 
END  If  (my  node  number  is  0) 

If  (my  node  number  is  not  0) 

Receive  the  RING 

Modify  the  num-done  and  message-count  fields  to  FINISHED 
Send  RING  to  next  processor 
Wait  for  DONE  message 
END  If  (my  node  number  is  not)0 
END  Terminate 


The  other  functions  discussed  pertain  to  sharing  work  between  processors.  The  first  function, 
send-work-request,  sends  a  WORK-REQUEST  message  to  other  processors  and  waits  for  their 
response.  This  function  also  receives  any  work  sent  by  another  processor  in  response  to  this  message. 

The  algorithm  is; 


send.work.requesi 

Send  WORK-REQUEST  ines.sage  to  nearest  neighbors 
If  (neighbor  has  work) 

Receive  work  from  neighbor 
Insf-rt  work  into  OPEN  list 
Expand  the  NODEs 
liND  If  (neighl)or  lias  work) 

If  (neighbor  has  no  work) 

Senil  WORK-REQUES'r  mes.sage  to  all  otlu'r  proce.ssors 
If  (processor  has  work) 

Reci'ivi;  work  from  proces.sor 
In.sert  work  into  OPE.N  list 
fixpand  the  NODEs 
f'.Nrt  If  ( pro< f.ssor  has  work) 

I'i.M)  If  (neifthbor  lia.s  no  work) 

E,TdD SI  I) d.trark.ifqnr si 


flit'  last  function,  shan  .work.  is  activated  when  a  WORK_REQEES'l  me.ssage  is  received. 
It  determiin  s  if  the  processor  has  enough  work  to  share,  transmits  with  either  a  1  RIT.  or  E.XESE 
responsi’  to  the  requesting  processor,  and  semis  the  work  if  ap|)ro|)riat e. 
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If  A  is  the  parameter  which  determines  if  there  is  enough  work  to  sliare,  Jansen  and  Sijster- 
mans  state  that  A  should  be  chosen  in  such  a  way  to  keep  all  the  processors  busy,  but  not  let 
communication  overhead  dominate  the  process.  If  A  is  too  small,  a  processor  could  share  work  and 
then  immediately  have  to  request  work  itself.  If  A  is  too  large,  proces.sors  could  be  idle  while  other 
processors  have  a  relatively  large  OPEN  list  [Jansen  and  Sijstermans,  1989:  273]. 

How  many  NODEs  to  share  is  determined  in  two  ways.  First,  if  the  OPEN  list  is  larger  than 
20,  then  10  .NODEs  are  sent  to  the  requesting  processor.  Requiring  the  OPEN  list  to  be  larger 
than  20  developed  from  experience  in  running  the  algorithm.  If  the  OPEN  list  is  less  than  20  but 
larger  than  a  predetermined  value,  then  half  of  the  OPEN  list  is  sent.  If  the  OPEN  list  is  smaller 
than  the  predetermined  value,  no  work  is  shared.  'The  number  u.sed  to  determine  how  much  work 
to  share  and  the  predetermined  value,  /?,  are  problem  specific  [Cvetanovic  and  Nofsinger,  1990:  87]. 

.Again,  one  goal  of  this  research  is  to  provide  guidelines  wlu'ii  setting  these  values. 

file  algorithm  is: 


slid rt  -irork' 

Heceive  the  WORK. REQUEST  me.ssage 
If  (ther<’  IS  enough  work  to  share) 

Send  work  to  requesting  proces.sor 
If  (there  is  not  enough  work  to  share) 

Send  I'.AUSE  respon.se  to  requesting  processor 
END  s/imv  .irorU 
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.As  disrusseil  in  Chapter  11,  this  research  investigates  two  variations  on  the  .A*  algorithm. 
This  section  jirovides  a  more  detailed  discii.ssion  of  the  algorit hms  and  gives  the  algorithms. 


4-5.1  IDA*  As  the  example  in  Chapter  II  shows,  the  main  difference  between  A*  and  IDA* 
is  IDA*  performs  a  limited  depth  first  search  on  the  node  selected  for  expansion  by  the  A*  portion 
of  the  algorithm. 

To  implement  the  parallel  IDA*  algorithm,  changes  were  made  to  the  centralized  list  Worker 
algorithm.  The  first  change  >s  to  build  another  queue  using  the  same  structures  and  functions  as 
for  OPEN,  but  used  to  store  the  NODEs  generated  during  the  depth  first  search  portion  of  the 
algorithm.  The  new  queue  is  called  ida.q  and  all  the  ida.q  queue  functions  are  named  using  the 
same  name  as  the  OPEN  queue  functions  but  add  ida_  to  the  front. 

The  other  change  to  the  CL  Worker  algorithm  is  to  add  the  control  for  the  depth  first  search 
to  the  main  algorithm.  As  long  as  child  nodes  do  not  exceed  the  cost  of  the  original  parent  node, 
they  are  kept  at  that  processor  for  expansion.  Child  nodes  which  exceed  the  cost  of  the  parent  are 
sent  back  to  the  Control  processor  for  insertion  into  the  OPEN  list. 

This  differs  from  Korf’s  method  of  discarding  all  generated  information  except  the  threshold 
cost  for  the  next  iteration.  This  was  done  for  two  retisons.  First,  the  algorithm  did  not  have  the 
problem  with  running  out  of  memory  that  Korf  experienced.  The  second  reason  was  this  research 
hoped  to  compare  the  IDA*  and  the  continuous  diffusion  algorithms.  Since  neither  algorithm  is 
optimized  in  ti^rms  of  execution  time  .  only  the  number  of  nodes  expanded  would  be  used  as  a 
metric  for  comparison. 

d  he  IDA*  Control  and  Host  algorithms  are  the  same  as  the  CL  Control  and  Host.  The  ID.'^* 
Worker  algorithm  is: 


Lotr  Level  ID.A*  Worker  Design 
Hi'ceive  cost  matrix 

While  (not  terminated  by  DONE  message) 

Send  VV'OHK-REQEEST  message  to  Controller 
Pi  ceive  EXP.'\ND_NODE  message  from  Controller 
Insert  receivecl  fi.NODE  into  ida_(| 

W  hile  (ida_q  not  empty)  loop 

fixpand  node  on  front  of  ida_q 
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Loop  until  all  cities  have  been  checked 

Add  a  city  to  end  of  cities  which  have  been  visited 
If  new  city  has  not  been  visited  in  this  partial  tour 

Calculate  the  cost  {h{7i)  and  /)  for  new  partial  tour 
If  (new  NODE. cost  <  BEST. cost)  and  (NODE. vector  is  a  tour) 

BEST  =  NODE 

Broadcast  NEW-BEST  to  all  processors 

If  (NODE. cost  <  BEST. cost)  and  (NODE. vector  is  not  a  tour)  and  (NODE. cost  <  E-NODE. cost) 
Insert  new  node  into  ida-q 

If  (NODE. cost  <  BEST. cost)  and  (NODE. vector  is  not  a  tour) 

Insert  new  NODE  into  OPEN 
END  Loop  until  all  cities  have  been  checked 
END  While  (ida_q  not  empty)  END  Loop  while  (not  terminated  by  DONE  message) 

Send  results  to  Controller  or  Host 
END/il.l*  Worker  Design 


/,.5.2  TSP  with  Levels  In  trying  to  compare  the  IDA*  algorithm  against  the  centralized  list 
or  distributed  list,  the  ID.A*  algorithm  is  at  a  disadvantage  because  of  the  assignment  problem  used 
to  calculate  the  estimated  cost  to  completion  f(n).  The  advantage  IDA*  has  is  its  ability  to  perform 
d<‘ptli  first  .sf'arclie.':  once  a  node  has  been  sent  to  the  processor  for  expansion.  Using  the  assignment 
problem  to  calculate  f(n)  also  provides  some  depth  first  search,  thus  negating  any  advantage  IDA* 
had. 

I'o  balance  out  the  advantage  provided  by  the  assignment  problem,  the  CL  and  DL  algorithms 
wer<'  changed  to  force  all  solutions  to  be  at  the  same  level  in  the  search  graph.  For  example,  in  a 
lit  city  1  SP  the  .solution  must  have  all  lO  cities.  Normally,  only  I  city  is  added  at  each  level  of  the 
search  gra|)h.  However,  the  a.ssignment  problem  can,  in  some  ca.ses,  provide  a  solution  from  any 
li'vel  in  the  graph.  To  counteract  this,  another  varialile  was  added  to  the  NODE  structure  telling 
what  lev('l  in  the  graph  the  state  is.  The  new  structure  is  : 


typedef  struct  { 

itit  vector[VECrOB_SIZE+l]; 
int  cost; 
int  link: 
int  level: 

)  NODE: 
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If  a  solution  is  found  but  the  level  did  not  equal  the  number  of  cities,  two  things  happened. 
First,  the  NODE. cost  becomes  the  new  global  BEST. cost.  Second,  the  variable  which  counts  the 
number  of  NODEs  expanded  is  incremented  until  the  NODE. level  equals  the  number  of  cities.  This 
way  the  solution  is  always  found  at  the  lowest  level  of  the  search  graph.  Now  at  least  the  number 
of  NODEs  expanded  can  be  compared  between  the  IDA*  and  the  other  algorithms. 

In  the  CL  algorithm,  only  the  Worker  algorithm  is  modified  to  implement  the  use  of  the 
levels.  The  Control  algorithm  still  only  checks  to  see  if  the  workers  are  idle  and  the  OPEN  list  is 
empty  before  terminating  the  task.  The  DL  algorithms  and  the  IDA*  algorithm  have  basically  the 
same  change  as  the  CL  Worker  algorithm,  so  only  the  CL  Worker  is  shown. 


TSP  Worker  uith  Levels  Design 
Heceive  cost  matrix 

While  (not  terminated  by  DONE  message) 

Send  WORK-REQt'EST  message  to  Controller 
Receive  EXPAND. NODE  message  from  Controller 
Loop  until  all  cities  have  been  checked 

Adtl  a  city  to  end  of  cities  which  have  been  visited 
If  new  city  has  not  been  visited  in  this  partial  tour 

Calculate  the  cost  (h{n)  and  /(n))  for  new  partial  tour 
If  (new  NODE. cost  <  BEST. cost)  and  (NODE. vector  is  a  tour) 
Increment  NODE. level  to  number  of  cities 
BEST  =  NODE 

Broadcast  NEW.BEST  to  all  processors 
If  (NODE. cost  <  BEST. cost)  and  (NODE. vector  is  not  a  tour) 
In.sert  new  NODE  into  OPEN 
END  (Loop  until  all  cities  have  been  checked) 

E.M)  (Loop  while  (not  terminated  by  DONE  message)) 

Send  results  to  Controller  or  Host 
ENDVN'/'  Worker  irilh  Levels  Design 


■l-o  i  Dtslrihnled  Lisl  with  Load  Balancing  and  NODE  Dtstrihufion  Assuming  that  the  se- 
i|ueiitial  algorithm  expands  the  fewest  nodes,  what  is  the  best  way  to  have  parallel  algorithms 
emulate  the  same  orilering  of  nodes  to  be  expanded.  The  centralized  list  algorithm  very  closely 
eiinilates  the  sequential  algorithm,  but  its  efficiency  is  limited  when  scaled  to  a  large  lumiber  of 
pri  >cessf)rs 
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While  the  distributed  list  with  load  balancing  insured  all  processors  are  kept  busy  until  all  the 
nodes  have  been  examined,  it  does  not  mean  they  are  ail  doing  productive  work.  Saletore  defines 
wasted  work  as  work  performed  that  to  the  right  of  the  solution  in  the  state  space.  For  example  see 
Figure  3-2.  He  assumes  a  left  to  right  search  of  the  state  space.  If  the  solution  is  state  1,  then  any 
node  expanded  to  the  right  of  state  1  is  Wcisted  work  [Saletore,  1991;  4].  Felten  describes  redundant 
work  as  expanded  nodes  which  the  sequential  algorithm  would  have  eventually  pruned  off  [Felten, 
1988:1503]. 
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Figure  1.6.  Initial  OPEN  Lists 
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In  the  sequential  or  CL  algorithms,  the  NODEs  would  be  ordered  in  non-decrecising  cost  so  the 
lowest  cost  NODE  would  always  be  expanded  next.  However,  with  DL  algorithms  each  processor 
has  its  own  local  list  of  nodes  to  expand  and  a  processor  could  be  expanding  a  node  with  a  much 
higher  cost  than  nodes  on  its  neighbors.  For  example,  Figure  4.6  shows  the  state  of  the  OPEN  lists 
at  a  certain  point  in  time.  Processors  0  and  1  have  approximately  equal  cost  NODEs  at  the  head 
of  the  list  while  processors  2  and  3  have  much  higher  cost  NODEs.  Obviously,  we  want  the  lowest 
cost  nodes  to  be  expanded  next. 

As  discussed  in  Chapter  II,  Cvetanovic  and  Nofsinger  propose  a  method  they  called  Continu¬ 
ous  Diffusion  to  evetily  distribute  the  lowest  cost  NODEs.  This  algorithm  periodically  distributes  a 
predetermined  number  of  NODEs  from  the  front  of  a  processor’s  OPEN  list  to  its  nearest  neighbors. 
Using  the  initial  state  of  Figure  4.6,  Figure  4.7  shows  the  state  of  the  OPEN  lists  after  processor  0 
distributed  1  NODE  to  its  nearest  neighbors,  processors  1  and  2.  Figure  4.8  shows  the  state  of  the 
OPEN  lists  after  processor  1  distributes  NODEs.  Notice  how  the  front  of  the  OPEN  lists  are  much 
tnore  uniform  in  cost.  Figure  4.9  shows  that  if  a  processor  with  high  NODE  costs  distributes,  the 
NODEs  .sent  are  inserted  farther  down  in  the  OPEN  list  [Cvetanovic  and  Nofsinger,  1990:  86-90]. 

Instead  of  distributing  to  its  nearest  neighbor,  Felten  suggests  randomly  distributing  the 
NODEs  [Felten,  1988,  1.502].  One  problem  with  this  approach  is  there  is  no  systematic  way  to 
distril>ute  the  lowest  cost  NODEs  to  other  processors.  A  processor  could  never  receive  distributed 
NODEs  and  have  much  higher  cost  NODEs  on  its  OPEN  list. 

One  proldern  with  these  approaches  is  determining  how  often  the  processors  should  distribute 
NODEs,  Cvetanovic  and  Nofsinger  define  6  as  the  number  of  nodes  a  processor  expands  before 
distriluit  ing  from  its  OPEN  list.  The  conflicting  goals  of  minimizingextra  search  and  load  imbalance 
versus  communication  overhead  determine  the  optimal  value  for  6  [Cvetanovic  and  Nofsinger,  1990: 
86-90],  Df'termining  guidelines  for  setting  the  value  of  S  is  one  goal  of  this  research. 
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Another  problem  with  distributing  NODEs  is  that  a  large  number  of  NODEs  with  the  same 
cost  are  generated.  For  example,  on  one  run  using  8  processors  and  100  cities,  every  processor  had 
at  least  1000  NODEs  with  a  cost  of  134  at  the  front  of  the  OPEN  list.  In  this  situation,  it  makes 
no  sense  to  distribute  if  all  the  processors  have  NODEs  with  the  same  cost  on  OPEN. 

To  implement  the  distributed  list  with  load  balancing  and  NODE  distribution  algorithm,  the 
distributed  list  with  load  balancing  algorithm  is  modified  to  distribute  NODEs.  Additional  control 
is  added  to  the  main  algorithm  to  determine  when  to  distribute.  Also  two  more  functions  are 
provided  to  perform  the  actual  distribution  of  the  NODEs. 

The  first  function,  distribute,  determines  if  there  is  enough  work  to  distribute.  This  function 
has  the  same  concerns  about  load  balancing  and  communication  overhead  as  the  share.work  func¬ 
tion.  Therefore  the  same  g\iidolines  used  for  share.work  are  used  in  distribute.  One  difference  is 
that  distribute  sends  at  most  2  NODEs  from  its  OPEN  list  to  any  neighbor.  The  algorithm  for 
distribute  is; 

distribute 

If  (there  is  enough  work  to  distribute)  send  NODEs  to  nearest  neighbors 
ET^DdisInhute 


The  other  function,  receive.dist,  receives  the  distributed  NODEs  and  inserts  them  into  the 
OPEN  list,  'file  algorithm  is: 


rereire^disl 

Fteceive  .NODEs  from  neighbor 
Insert  .NODEs  into  OPEN  list 
END  recrire^dist 


.Additional  variables  are  needed  to  help  control  when  to  distributi'  NODEs.  SC  reprt'S('nts 
the  number  of  node.s  expanded  since  the  last  time  NODEs  were  distributed  and  b  is  the  mimmuin 
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number  of  nodes  a  processor  must  expanded  before  distributing  again.  The  distributed  list  with 


load  balancing  and  NODE  distribution  algorithm  is: 


TSP  Worker  With  Load  Balancing  and  distribution  Design 
Receive  cost  matrix  from  host 

Perform  depth  first  search  on  one  node  to  determine  initial  BEST. cost 

Generate  starting  node  in  search  tree 

Distribute  children  of  initial  node  among  all  processors 

While  (not  terminated  by  DONE  message) 

While  (OPEN  not  EMPTY) 

Remove  NODE  from  OPEN  queue 
Check  for  RING 

If  (SC  >  6)  then  distribute  NODEs  to  nearest  neighbors 
Check  for  distributed  NODEs  from  neighbors 
Check  for  WORK-REQUEST  message  from  another  processor 
Loop  until  all  cities  have  been  checked 

Add  a  city  to  end  of  cities  in  NODE. vector  which  have  been  visited 
If  new  city  has  not  been  visited  in  this  partial  tour 

Calculate  the  cost  (h(n)  and  /)  for  new  partial  tour 
If  (new  NODE. cost  <  BEST. cost)  and  (NODE. vector  is  a  tour) 
BEST  =  NODE 

Broadcast  NEW. BEST  to  all  processors 
If  (NODE. cost  <  BEST. cost)  and  (NODE. vector  is  not  a  tour) 
Insert  new  NODE  into  OPEN 
END  (Loop  until  all  cities  have  been  checked) 

END  While  (OPEN  not  EMPTY) 

Send  WORK-REQUEST  to  other  processors 
Terminate  the  proces.sors 

END  Loop  while  (not  terminated  by  DONE  message) 

Send  results  to  Host 

END  T5P  Worker  With  Load  Balancing  and  Distribution  Design 


4.b  Summary 

This  chapter  described  an<l  provided  examples  of  the  data  structures  used  by  the  algorithms. 
The  high  level  design  of  the  TSP  algorithm  was  further  developed  by  adding  communication  require¬ 
ments  and  architecture  specific  details.  Algorithms  of  the  functions  used  by  the  main  algorithms  are 
discus.sed.  Changes  to  the  basic  parallel  'I'SP  algorithm  including  delayed  A*  and  distributed  list 
queues  with  and  without  load  balancing  are  discussed  and  the  algorithms  provided  and  explained. 


Also  discussed  were  two  variations  to  the  bcisic  A*  algoritlim.  The  first  algorithm,  IDA*, 
performed  a  limited  depth  first  search  in  conjunction  with  the  A*  algorithm.  The  continuous 
diffusion  algorithm  attempts  to  insure  the  nodes  being  expanded  are  not  Wcisted  work  by  exchanging 
NODEs  between  processors. 

The  next  chapter  provides  and  discusses  the  results  of  all  the  algorithms. 
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V.  Results 


5.0.1  Introduction  The  previous  two  chapters  presented  tlie  design  of  the  sequential  TSP 
using  the  A*  algorithm.  Also  presented  were  parallel  algorithms  of  the  TSP  using  A*  including 
centralized  list,  distributed  list  with  and  without  load  balancing,  IDA*,  and  continuous  diffusion. 
The  CL  algorithm  weis  developed  first  and  then  tested.  While  this  algorithm  is  efficient  for  a  small 
number  of  processors,  the  other  algorithms  were  developed  to  reduce  the  amount  of  idle  time  on 
the  processors  or  to  reduce  the  number  of  states  explored  by  the  algorithm. 

The  purpose  of  this  chapter  is  to  discuss  the  data  gathered  While  executing  these  algorithms. 
Section  5.2  discusses  the  metrics  used  to  gather  and  evaluate  the  data  during  testing  of  the  algo¬ 
rithms.  Also  provided  are  the  test  cases  against  which  the  programs  were  run.  Section  5.3  discusses 
how  to  read  the  test  results.  Only  the  results  of  the  programs  are  presented  in  this  chapter.  The 
evaluation  and  interpretation  of  the  results  is  presented  in  Chapter  VI. 

5. 1  Metric.'} 

This  section  is  a  further  discussion  of  the  metrics  presented  in  Chapter  III  and  provides  a 
detailed  explanation  of  the  metrics  used  in  this  research. 

When  gathering  data  on  a  program,  it  is  initially  difficult  deciding  how  much  and  of  what 
type  of  data  to  collect.  Too  much  data  can  swamp  someone  trying  to  evaluate  it  and  they  might 
miss  something  of  importance.  Too  little  data  and  something  of  importance  might  not  be  reported. 
.M.so,  the  amount  of  data  data  collected  can  iiave  an  adverse  impact  on  the  |>erformance  of  the 
program.  While  there  was  a  basic  core  set  of  parameters  that  had  to  bo  measured  only  once,  the 
frequency  of  other  parameters  was  determined  on  a  trial  and  error  basis.  The  first  few  runs  of  the 
centralized  list  program  produced  too  much  information.  A  relatively  small  program  took  minutes 
lo  run  compared  to  the  .seconds  it  took  to  run  after  the  parameters  were  timid. 


file  specific  parameters  used  to  evaluate  the  'LSI*  A*  programs  are  listed  below: 


•  Total  program  run  time  —  The  total  execution  time  of  the  program,  from  initiation  to  ter¬ 
mination. 

•  Initiation  time  —  Time  spent  loading  the  cost  data  and  initializing  the  variables. 

•  Search  time  —  The  time  spent  searching  for  the  solution.  This  includes  communication  and 
idle  time.  It  is  calculated  by 

Searchtime  =  Totaltime  —  Initializingtime 

•  Processor  efficiency  —  The  ratio  of  the  time  a  processor  was  in  the  search  portion  of  the 
program  versus  the  total  execution  time.  The  search  portion  of  the  algorithm  does  not 
include  any  idle  or  communication  time. 

•  Average  j)rocessor  efficiency  —  The  average  of  the  processor  efficiencies.  One  or  two  processors 
could  have  a  low  efficiency,  but  the  overall  efficiency  of  the  program  could  still  be  high. 

•  States  expanded  per  processor  —  The  number  of  states  expanded  per  processor.  This  metric 
can  indicate  idle  time  or  inefficiencies  in  distributing  the  work. 

•  'Total  states  expanded  —  The  total  number  of  states  expanded.  This  is  one  of  the  best  metrics 
for  comparing  search  algorithms. 

Other  parameters  that  are  not  u.sed  to  evaluate  the  algorithms  but  were  used  for  troubleshoot- 
ng  puriioses  were: 

•  NEVV.BKS'T  —  A  printed  message  indicating  a  new  global  best  solution  was  found  and  by 
which  processor. 

•  Queue  size  printed  message  showing  the  size  of  the  OPEN  list  on  the  Worker’s  or  the 

centralized  OPFiN  list  on  the  Controller.  Also  printed  the  cost  of  the  next  NODE  to  be 


expanded.  This  parameter  was  very  helpful  in  determining  the  effectiveness  of  the  prune 
function. 

I 

These  parameters  also  provided  a  feel  of  how  the  programs  were  running.  Several  times 

1 

problems  were  detected  just  by  the  program  not  acting  as  it  had  in  the  past. 

The  IDA*  algorithm  had  the  following  unique  parameters; 

•  IDA_NEW_BEST  —  A  printed  message  indicating  a  new  global  best  solution  was  found 
during  the  IDA*  portion  of  the  algorithm  and  by  which  processor. 

•  IDA.expanded  —  Number  of  states  expanded  during  the  IDA*  portion  of  the  algorithm. 

Parameters  unique  to  the  distributed  list  programs  are: 

•  Distributed  —  The  number  of  times  a  processor  distributed  work  to  its  nearest  neighbors 

•  Asked.for.work  —  The  number  of  times  a  processor  became  idle  and  requested  work  from 
another  proces.sor. 

•  Share  —  The  variable  used  to  adjust  when  a  processor  had  enough  work  to  share  with  another 
processor.  This  variable  was  used  in  both  the  “share.work”  and  the  “distribute”  functions. 

•  Lambda  —  The  variable  used  to  determine  when  a  processor  should  distribute  NODEs  to  its 
nearest  neighbors. 

5.2  7'esttng 

The  main  measure  of  an  algorithm  is  that  it  produces  the  correct  results.  As  the  example 
in  Chapter  I  showed,  <;ven  relatively  .small  problems  can  have  a  prohibitively  large  number  of 
combinations  which  must  be  checked.  The  method  used  to  validate  the  program  results  was  to  run 
the  program  and  u.se  the  results  as  the  “best”  .solution.  With  this  solution  as  the  bound,  problems 


.T3 


with  4,  10,  and  22  cities  were  then  solved  by  hand.  For  example,  Figure  5.1  shows  the  configuration 
for  the  4  city  input.  Figure  5.2  shows  the  search  graph  generated  by  hand  to  solve  this  problem. 


85 

Figure  5.1.  d'SP  for  4  Cities  using  File  n4a 


While  the  4  and  10  city  problem.s  were  relatively  easy  to  solve,  the  22  city  problem  was 
selected  because  it  was  the  about  the  largest  problem  solvable  by  hand  in  a  reasonable  amount  of 
time.  Even  having  the  best  bound  possible,  i.  e.  ,  the  solution,  this  problem  took  over  4  hours  to 
solve. 

In  each  case,  the  solution  generated  by  the  program  and  the  solution  generated  by  hand  had 
the  same  Ix'st  cost.  Since  it  is  irnpo.ssible  to  check  problems  of  any  size  by  hand  and  the  same  test 
ca,ses  were  run  using  different  algorithms  and  parameters,  the  assumption  is  made  that  the  correct 
answer  is  produced  if  it  matches  the  answer  given  by  different  algorithms  running  the  same  test 
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Figure  5.2.  Search  Craph  for  4  Cities  using  File  ii'la 


The  distributed  list  with  load  balancing  and  the  continuous  diffusion  algorithms  required 
special  testing  because  they  have  parameters  which  must  be  tuned  for  each  application.  Therefore 
there  are  several  different  runs  of  the  same  program  with  the  parameters  changed.  A  listing  of  the 
test  case  inputs  and  solutions  for  all  the  problems  are  provided  in  Appendix  C 

Each  test  case  Wcis  run  using  all  the  algorithms  described  in  Chapter  IV  on  2,  4,  8,  16,  and  32 
processors.  Since  the  Air  Force  Institute  of  Technology  (AFIT)  has  only  an  8  processor  hypercube, 
another  hypercube  with  64  processors  was  located  at  Oak  Ridge  National  Laboratory,  Oakridge, 
TN,  While  this  hypereube  had  enough  processors  for  the  tests,  other  problems  arose.  First,  the 
network  used  to  connect  to  the  Oakridge  hypercube,  the  Defense  Data  Network  (DDN),  had  routing 
problems.  The  software  routines  to  route-  the  telephone  connections  had  been  recently  modified 
and  no  reliable  path  between  AFIT  and  Oakridge  could  be  found.  Response  times  of  up  to  10 
tninutes  for  each  keystroke  were  noted.  This  problem  was  finally  partially  solved  by  logging  into 
a  computer  at  Phillips  Lab  in  Albuquerque,  N.M  then  logging  into  the  Oakridge  hypercube.  This 
produced  a  response  time  of  about  2  seconds  which  was  acceptable,  but  the  computer  at  Phillips 
Lab  was  down  for  maintenance  frequently. 

Another  problem  with  the  Oakridge  hypercube  was  the  operating  system  of  the  iPSC/2  is  not 
very  robust  and  can  “crash”  quite  easily.  When  working  with  the  .AFIT  iPSC/2,  the  status  of  the 
computer  can  be  monitored  by  watching  the  status  lights  on  the  front  of  the  computer.  Also,  the 
system  administrator,  Richard  Norris,  was  vert/  helpful  in  determining  the  cause  of  the  crash  and 
ways  to  fix  it.  Since  the  iPSC/2  at  Oakridge  was  remote,  the  status  lights  could  not  be  monitored. 

5.  t  Test  Results 

1  his  section  provides  a  listing  of  the  data  results  and  what  each  table  measured.  The  tables 
along  with  graphs  of  data  from  the  tables  are  provided  in  Appendix  H. 


•  Appendix  C  —  List  tlie  lest  Ccises  and  the  solution  to  each  one.  The  solution  is  the  order  in 
which  the  cities  are  visited  and  the  associated  cost. 

•  T.ABLE  B1  —  Centralized  list  program  test  results,  including  execution  time,  stales  expanded, 
and  average  processor  efficiency.  The  entry  for  one  processor  is  the  data  for  the  seipienlial 
algorithm. 

•  T.ABLE  B2  —  Distributed  list  without  load  balancing  program  test  results,  including  execu¬ 
tion  time,  states  expanded,  processor  efficiency,  and  average  processor  efficiency. 

•  d.'MlLE  Bll  —  Distributed  list  with  load  balancing  program  test  results,  including  execution 
time,  stales  expanded,  and  average  processor  efficiency.  Also  included  are  the  number  of 
times  the  processor  asked  for  work. 

•  r.ABLEs  B'l  and  Ika  -  Distributed  list  with  load  balancing  and  distributing  program  test 
results,  including  execution  lime,  states  expanded,  processor  efficiency,  and  av('rage  processor 
efficiency.  Also  incliidc'd  are  the  number  of  times  the  processor  ;isked  for  work  and  distributed 
N'ODEs  to  its  nearest  iK'ighbors. 

•  r.ABLfis  Bf5  and  B7  Distributed  list  with  load  balancing  program  test  results,  including 
execution  time,  states  exi)anded,  processor  efficiency,  and  average  processor  efficiency.  Also 
incliuled  are  the  number  of  times  the  processor  asked  for  work  and  distributer!  NODEs  to  its 
nearest  neighbors.  'Lhis  table  differs  from  TABLE)  B3  in  that  it  shows  the  optimal  range  for 
the  share  varial)le. 

•  r.\BLI')s  BS  anti  B9  Dist  ibuted  list  with  loarl  balancing  and  distributing  program  tt'sl 
results,  including  r'xecu' 'on  t  itne,  slates  expanded,  processor  efficiency,  and  average  jirocessor 
efficiency.  .Mso  included  are  the  number  of  times  the  processor  asked  for  work  and  dist  ributed 
NODEs  to  its  nearest  neighbors.  This  table  differs  from  TABLEs  Bl  anrl  B.a  in  that  it  shows 
tlu'  optimal  rangt'  for  tlie  dist.rit)ute  variable. 


•  TABLK  BIO  —  IDA*  program  lest  results,  including  execution  time,  stales  expanded,  and 
average  processor  efficiency.  Also  included  are  the  number  of  states  expanded  during  the 
IDA*  portion  of  the  program. 

•  TABLE  Bll  —  Level  program  test  results,  including  execution  time,  states  expanded,  and 
average  processor  efficiency.  Also  included  are  the  number  of  states  expanded  during  the  level 
portion  of  the  program. 

■5.. I  Sum  III  ary 

'I  liis  chapter  provuh'd  the  results  from  all  the  algorithms  developed  and  tested.  The  metrics 
used  to  evaluate  and  compare  the  algoritlims  are  also  discussed.  Due  to  limitations  in  time  and 
the  adverse  impact  on  program  »'xecution,  data  on  all  possible  metrics  were  not  collected. 

The  sequential  version  of  the  TSl’  using  A*  was  shown  to  return  the  correct  results  on  known 
test  cas«'s  and  problems  of  small  enough  size  to  be  evaluated  by  hand.  Therefore,  the  results  of 
the  se(|uential  program  are  a.ssumed  correct.  .No  parallel  version  of  the  algorithms  returned  a  cost 
dilferi’iil  than  the  secpientifd  jirogram. 

1  he  ni'xt  cha|iter  evaluat.  s  the  results  from  the  rlifferent  algorithms.  Conclusion  about  the 
t'lficii'iKy  of  the  algc'rithms  are  drawn  and  recommendation;,  for  further  work  is  presented. 


VI.  Conclusions  and  Recommendations  for  Further  Work 


6. 1  Introduciton 

Chapter  I  describes  the  nature  of  NP-complete  problems  and  provides  an  example  of  the 
exponential  nature  of  the  time  and  polynomial  nature  of  the  space  requirements  of  these  problems. 
Also  described  are  some  of  the  physical  constraints  and  limits  on  the  capabilities  of  sequential 
computers.  These  limitations  coupled  with  the  increasing  power  and  decreasing  cost  per  million 
instruction  per  second  (MIPS)  have  led  to  the  u.se  of  parallel  computers  for  large,  complex  problems. 

Chapter  II  provides  the  background  for  this  research  investigation.  A  more  detailed  definition 
of  NP-complete  is  given  along  with  the  relationship  between  NP-complete,  P-time,  NP-time,  P- 
space,  and  NP-space.  Parallel  computers  in  general  and  the  hypercube  specifically  are  discussed. 
Some  of  the  problems  related  to  parallel  programming  of  search  algorithms  are  listed  and  briefly 
explained.  Finally,  different  .search  techni<|ue.s  are  described.  Which  technique  or  combination  of 
techniques  to  u.se  is  problem  specific  and  Table  2.2  provides  a  general  listing  of  the  strengths  and 
weakne.ss  of  each  algorithm. 

More  specific  background  information  relating  to  metrics  used  to  evaluate  parallel  algorithms 
and  specifically  parallel  search  techniques  is  discussed  in  Chapter  III.  Metrics  such  as  processor  idle 
time,  program  run  time,  and  number  of  states  expanded  were  cho.sen  for  their  ability  to  measure 
and  show  the  relative  efficiency  of  the  programs.  The  a.ssignment  problem  and  how  it  relates  to 
the  implementation  of  the  traveling  salesman  problem  (TSI’)  is  di.scussed  and  an  example  of  its  use 
given.  Finally,  the  high  level  design  of  the  parallel  A*  TSP  algorithm  using  a  centralized  list  (Cb) 
is  discussed.  Pseudo  code  for  the  algorithm  is  provided  along  with  a  discussion  of  each  part  of  the 
algorithm. 

('hapter  IV  (hsciisses  the  low  level  design  of  the  parallel  A*  TSP  algorithms.  High  com¬ 
munication  overhead  and  [)oor  scalability  due  to  the  master  processor  becoming  a  communication 
bottleneck  in  the  ('\,  version  of  t  he  FSP  algorithm  led  to  the  development  of  distributed  list  (l)b) 


versions  of  the  parallel  TSP  algorithm.  When  to  share  or  distribute  work  are  the  main  areeis  of 
study  for  these  algorithms.  Also,  an  IDA*  version  of  the  TSP  algorithm  is  developed,  f’or  each  of 
these  algorithms,  the  pseudo  code  is  given  ami  a  tletailcd  discussion  of  the  algorithm  is  provided. 

Chapter  V  presents  the  rr  ults  from  each  algorithm,  including  which  metrics  are  used  for 
evaluating  the  use  of  the  synthesized  program  and  the  testing  process.  A  listing  and  explanation 
of  how  the  data  is  displayed  is  also  provided.  Interpretation  of  the  results  is  left  for  this  chapter. 

This  chapter  has  two  main  objectives.  First,  the  data  from  each  of  the  algorithms  is  inter¬ 
preted  and  compared  to  the  other  algorithms.  .Specifically,  the  three  main  metrics  of  program  run 
time,  states  expanded,  and  proces.sor  idle  time,  are  compared  and  explanations  for  the  differences 
are  provided.  Graphs  of  applicable  data  are  provided  to  substantiate  the  explanations  for  the 
differences  among  the  algorithms. 

During  any  time  limited  rcsearcli  effort,  it  is  never  po.ssible  to  explore  all  avenues  of  interest 
or  problem  are.is.  Therefore,  the  other  objective  of  this  chapter  is  to  provide  recommendations  for 
further  study. 

6.  2  Interpretation  of  the  Re.'iults 

In  this  section,  the  results  from  the  four  main  algorithms  investigated  by  this  research,  are 
discussed.  These  algorithms  are: 

•  Centralized  list  .‘\*  I'SP 

•  Distriluited  list  with  no  load  balancing  A*  T.SP 

•  Distributed  list  with  load  balancing  A*  d'.SP 

•  Distributed  list  with  load  balancing  and  work  flistribution  A*  TSF* 

Abderlralirnan  and  .\Iudge  [Abderirahman  and  Mudge,  19S8:  IdhT],  Quinn  [Quinn,  1990:  OS.a- 
987],  and  others  have  shown,  the  master  proces.sor  in  a  worker/manager  functional  decomposition 


using  a  centralized  list  can  become  a  communications  bottleneck.  They  also  state  that  after  a 
certain  number  of  processors,  increasing  the  number  of  processors  actually  increa.ses  the  execution 
time  due  to  this  bottleneck.  For  the  Intel  iPSC/2  hypercube,  the  “magic”  number  of  proce.ssors  is 
about  16  and  this  is  validated  by  this  research.  Therefore,  the  algorithms  are  compared  twice,  once 
using  16  or  less  processors,  called  small  scale  computers,  and  once  using  more  than  16  processors, 
called  large  scale  computers. 

6.2.1  Preliminary  Depth  First  Search  (DFS)  The  high  level  design  for  the  Control  CL  ver¬ 
sion  of  the  parallel  TSP  algorithm  performs  an  initial  DFS  of  one  node.  This  is  done  by  traversing 
all  cities  in  numerical  order.  For  example,  a  10  city  problem  initial  solution  isl-‘2-3-4-5- 
6  -  7  -  8  -  9  -  10-1.  This  arbitrary  tour  provides  an  initial  best  solution  used  to  bound  the  search 
process.  Fven  for  the  100  city  problem,  this  DFS  took  less  than  1  millisecond. 

Initially,  using  the  DFS  appeared  to  have  no  effect  on  the  time  of  the  .search  or  the  number  of 
states  expanded.  How'ever,  it  was  found  that  on  some  problems  the  run  time  and  number  of  states 
expanded  were  greatly  reduced.  For  example,  the  65  city  problem  in  file  “n65a”  expands  1697 
states  and  takes  •115.3  .seconds  to  run  using  the  DFS  on  4  processors.  Without  the  DFS,  it  expands 
3509  states  in  8653  seconds.  Obviously,  the  possibility  of  wasting  less  than  1  millisecond  is  w'orth 
the  large  gains  the  DFS  might  provide.  Therefore,  DFS  is  used  with  all  algorithms  developed  for 
this  re.st'arch. 

■Another  variation  not  investigated  by  this  research  is  to  have  each  i)rocessor  perform  a  differ- 
<Mil  DFS  and  use  the  l)est  solution  as  the  initial  bound.  Also,  each  proce.s.sor  could  perform  several 
depth  first  searclu's  before  continuing  on  with  program  execution.  The  tradeoff  between  time  spent 
in  DFS  versus  the  time  saved  by  finding  a  “good”  initial  bouiul  need  to  be  investigated. 

6.2.2  Fvatnntinn  of  the  .Algorithms  In  evaluating  any  algorithm,  two  main  metrics  are  used. 
First,  an  algorithm  must  be  effective  in  that  it  [)rovides  the  correct  answer.  Secondly,  an  algorithm 


is  evaluated  as  to  its  efficiency  in  both  time  and  memory  requirements.  This  research  does  not 
investigate  methods  to  improve  or  compare  memory  usage. 

The  three  main  efficiency  metrics  used  in  this  research  to  evaluate  an  algorithm’s  execution 
time  efficiency  are  total  execution  time,  number  of  states  expanded,  and  processor  idle  time.  Of 
these,  the  most  important  is  execution  time.  For  non-research  problems,  the  bottom  line  is  providing 
a  correct  solution  in  the  shortest  time  possible,  or  at  least  in  an  acceptable  time.  So  the  overall 
goal  is  to  find  algorithms  which  perform  tasks  quicker.  While  there  is  normally  a  direct  correlation 
between  the  number  of  states  expanded,  processor  idle  time,  and  the  execution  time  of  a  search 
algorithm,  states  expanded  and  idle  time  are  still  useful  metrics.  The  number  of  states  expanded 
can  be  u.sed  to  compare  different  algorithms  or  the  same  algorithm  run  on  different  computers. 
Froces.sor  idle  time  is  helpful  in  showing  where  a  parallel  algorithm  is  inefficient  and  po.ssibly  how 
to  improve  it.  The.se  metrics  are  the  basis  for  comparing  the  algorithms. 

For  all  algorithms  discussed,  four  different  problem  sizes  are  run.  These  four  problems  are 
ref>resentative  of  the  different  size  problems  likely  to  be  encountered  and  contain  22,  55,  (55,  and 
100  cities.  Descriptions  of  the  problems  along  with  the  cost  matrices  and  problem  solutions  are  in 
■Appendix  C.  Results  such  as  e.xecution  time,  number  of  states  expanded  and  processor  efficiency 
are  provided  in  Appendix  B.  Some  of  the  graphs  of  data  are  also  provided  in  the  appropriate 
sections  for  ea.se  in  understanding  the  discussion. 

6.2.3  Small  Scale  Parallel  Computers  This  section  compares  the  different  algorithms  using 
If)  or  less  [irocessors  on  an  Intel  iPSC/2  hyperciibe.  Each  of  the  four  algorithms  are  discussed  and 
compared  to  the  others. 

6.2.3. 1  Centralized  L%st  Algorithm  (CL)  The  first  thing  to  notice  is  that  for  all  cases, 
the  Cl,  algorithm  outperformed  all  others  in  both  execution  time  and  in  the  number  of  states 
expanded.  Ibis  is  the  same  results  obtained  by  Abderirahman  and  Miidge  [Abderlrahman  and 
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Mudge,  1988:  1497],  Quinn  [Quinn,  1990:  385-387]  and  many  others.  This  is  because  in  small 
scale  computers,  the  master  processor  does  not  become  a  communication  bottleneck.  This  allows 
for  relatively  efficient  load  balancing  of  the  processors  and  reduces  the  processor  idle  time.  Also, 
since  the  OPEN  list  is  kept  on  a  single  processor,  the  order  in  which  the  states  are  expanded  closely 
resembles  the  sequential  algorithm.  The  graphs  in  Appendix  B  show  the  number  of  states  expanded 
per  processor  is  almost  a  horizontal  line  for  the  CL  algorithm.  As  discussed  in  Chapter  111,  this 
means  relatively  few  states  not  expanded  by  the  se(iuential  algorithm  are  expanded  resulting  in 
little  wasted  work. 

Distributed  List  With  No  Load  Balancing  (DL.NLD)  The  distributed  list  with 
no  load  balancing  (DL.NLB)  algorithm  performed  the  worst  of  all  the  algorithms  in  relation  to 
execution  time  and  number  of  states  expanded.  Again  this  conforms  with  what  others  have  found 
[Abderlrahmaii  and  .Mudge.  1988:  1497]  [Quinn,  1990:  385-387]  [Feltcn, 1988:1501]  [llayt^s  and 
Mudge,  1989:  1839]  [Cvetanovic  and  Nofsinger,  1990:  86-89],  If  it  was  known  along  which  path  a 
solution  could  be  found,  only  that  path  would  be  explored.  But  since  this  is  not  a  greedy  algorithm, 
all  paths  must  be  explored  implicitly  or  explicitly.  Also,  it  can  not  be  determined  which  path  will 
generate  the  most  work,  so  there  is  no  way  to  evenly  allocate  the  work  load  at  the  beginning  of 
the  algorithm.  Because  the  DL.NLB  algorithm  only  divides  the  work  once  at  the  beginning  of  the 
algorithm  and  never  balances  the  load  again,  once  a  proces.sor  finishes  its  a,ssigned  work,  it  sits  idle 
until  all  other  processors  finish.  I'lierefore,  the  DLJVLFi  execution  time  is  the  time  of  the  longest 
path  to  a  solutic^n  or  i)ound  of  the  search. 

In  a  DI.-NLB  algorithm,  the  only  way  speedup  can  occur  is  by  dividing  the  work  among 
enough  processors  so  a  “good"  boundary  solution  is  quickly  found.  This  allows  the  algorithm  to 
prune  or  eliminate  numerous  s<'arch  paths.  As  the  figures  in  A[)pendix  El  show,  there  are  initial 
drops  in  the  execution  times,  but  then  the  times  are  almost  constant.  Adding  more  processors  does 
not  noticeably  decrea.se  the  execution  time  because  the  run  time  limit  is  the  time  of  the  longest 


solution  or  bound.  Adding  more  processors  to  solve  the  problem  reaches  the  longest  path  quicker, 
but  do  not  help  decrease  the  time  to  explore  the  path. 

6. 2. 3. 3  Distributed  List  With  Load  Balancing  (DL.LB)  As  discussed  in  Chapter  II, 
one  way  to  reduce  the  algorithm  execution  time  when  using  a  distributed  list  algorithm  is  to  allow 
processors  to  request  work  from  another  processor  when  they  become  idle.  One  problem  with  this 
approach  is  the  overhead  associated  with  determining  if  a  processor  has  enough  work  to  share  and 
the  communication  to  pass  work  between  processors.  These  problems  are  discussed  in  greater  detail 
in  section  6.2.5. 

Figures  Bl,  B2,  B3,  and  B^  show  the  problem  size  has  a  real  impact  on  the  execution  time  of 
the  DL.LB.  For  example,  the  22  city  problem  actually  increased  in  execution  time  going  from  2  to 
4  processors  then  decreased  again  from  4  to  8  processors.  The  22  city  problem  at  times  ran  slower 
than  the  DL.NLB  algorithm  and  never  dramatically  decreased  the  execution  time.  The  55  city 
problem  also  ran  approximately  the  same  amount  of  time  as  the  DLJ^LB  algorithm.  But  looking 
at  the  65  and  100  city  [Problems,  you  notice  dramatic  decreases  in  execution  times.  For  example  in 
the  100  city  problem  using  4  processors,  the  times  decreased  from  33504  to  20117  seconds.  This  is 
a  decretase  of  40  %\ 

This  discrepancy  between  small  and  large  sized  problems  is  explained  by  the  communication 
and  task  granularities  of  the  different  problems.  As  stated  in  Chapter  II,  granularity  is  a  measure  of 
relative  size  or  frequency  of  an  event  or  computation.  In  the  small  problems,  the  amount  of  overhead 
associated  with  sharing  work  far  outweighs  the  gain  provided  by  not  having  idle  proces.sors.  The 
problems  are  so  small  that  it  is  more  eflicient  to  solve  them  using  the  CL  algorithm  or  even  a  single 
proce.ssor. 

Fhe  two  large  problems  are  more  representative  of  the  size  of  problems  that  would  be  solved 
using  a  parallel  computer.  For  these  problems,  the  DL.LB  algorithm  performs  better  than  the 
DL.NLB  algorithm  but  not  as  good  as  the  CL  algorithm.  The  idle  time  associated  with  the 


DL.NLB  algorithm  is  greatly  reduced,  but  the  overhead  associated  with  load  balancing  still  makes 
this  algorithm  less  efficient  than  the  CL  algorithm.  The  number  of  states  expanded  by  this  algorithm 
is  also  much  higher  than  for  the  CL  algorithm.  This  is  due  to  each  processor  having  its  own  local 
OPEN  list.  This  means  only  a  local,  not  global,  best  state  is  selected  for  expansion  resulting  in 
Wcisted  work. 

6.5.3.4  Distributed  List  with  Load  Balancing  and  Distribution  (DL-DIST)  The  DLJDIST 
algorithm  is  identical  to  the  DL-LB  algorithm  with  the  addition  of  a  function  to  distribute  work 
from  its  OPEN  list  to  its  nearest  neighbors.  This  is  an  attempt  to  emulate  the  global  OPEN 
list  of  the  CL  algorithm  and  reduce  the  number  of  states  expanded.  For  the  small  problems,  the 
DIv-DIST  performance  is  better  than  the  DL_LB  algorithm.  For  the  larger  problems,  the  perfor¬ 
mance  is  mixed.  In  the  65  city  problem,  both  the  execution  times  and  number  of  states  expanded 
are  better  for  the  DL.DIST  algorithm  than  the  DL.LB  algorithm.  However,  in  the  100  city  prob¬ 
lem,  the  DL-LB  algorithm  executes  in  less  time  until  about  7  processors  are  used.  Using  between 

7  and  16  processors  the  DL.DIST  algorithm  performs  better.  In  both  cases,  the  differences  are  not 
very  great.  Again,  the  DL.LB  and  DL.DIST  algorithms  are  discussed  in  section  6.2.5. 

6.2.3. 5  Small  Scale  Parallel  Computer  Summary  For  the  small  scale  computer,  the 
CL  algorithm  performed  better  than  the  other  algorithms.  However,  when  using  8  processors  or 
more,  the  master  processor  begins  to  become  a  bottleneck  and  the  execution  time  becomes  almost 
constant.  Increasing  the  number  of  processors  does  not  decrease  the  execution  time.  The  DL.NLB 
algorithm  performed  the  worst  of  all  the  algorithms  due  to  load  imbalances  resulting  in  processor 
idle  time.  Both  the  DL.LB  and  DL.DIST  algorithms  did  not  perform  as  well  as  the  CL  algorithm, 
but  at  16  processors  the  execution  time  curves  are  beginning  to  approach  the  CL  execution  time 
curve.  When  using  either  the  DL.LB  or  DL.DIST  algorithms,  adding  processor  greatly  decreased 
the  execution  times.  The  DI,_LB  and  DL.DIST  algorithms  are  discu.ssed  in  more  detail  in  section 
(i.2.5. 
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6.2.4  Large  Scale  Parallel  Computers  This  section  discusses  the  behavior  and  performance 
of  the  algorithms  using  16  or  mor :  processors  to  solve  the  problems.  Because  of  hardware  problems 
with  the  Oak  Ridge  Nation  Laboratory’s  iPSC/2,  not  all  algorithms  were  able  to  be  run  using  32 
processors  for  all  problem  sizes.  However,  enough  data  was  collected  to  show  the  overall  trends  of 
each  algorithm. 

6.2.4. 1  Centralized  List  As  discussed  earlier,  the  CL  algorithm’s  main  deficiency  is 
that  the  meister  processor  becomes  a  communications  bottleneck  forcing  slave  processors  to  remain 
idle  waiting  for  new  states  to  expand.  This  trend  is  painfully  obvious  when  using  more  than  16 
processors.  The  execution  time  curves  for  all  problem  sizes  had  already  begun  to  flatten  out  when 
using  between  8  to  16  processors.  Increasing  the  number  of  processors  provided  no  noticeable 
decrease  in  the  execution  times  for  the  problems,  and  in  the  case  of  100  cities  the  execution  time 
began  to  increase. 

One  important  factor  to  note  when  comparing  the  different  algorithms  is  the  processor  ef¬ 
ficiency  versus  the  number  of  states  expanded.  The  CL  algorithm  always  expanded  the  fewest 
number  of  states,  but  its  efficiency  really  drops  off  ;vs  more  processors  are  used  to  solve  the  prob¬ 
lem.  file  distributed  list  algorithms  expand  many  more  states  than  the  CL  algorithm  and  their 
processor  efficiencies  are  relatively  high.  For  example,  in  the  100  city  problem,  the  CL  algorithm’s 
efficiency  decreases  from  0.938  to  0.422  when  using  4  and  16  proce.ssors  respectively.  However,  the 
DL-biST  algorithm’s  efficiency  only  drops  from  0.997  to  0.988.  Since  the  CL  algorithm’s  execution 
time  is  always  less  than  the  DL_DIST  algorithm’s,  this  im|>lies  one  of  the  most  important  factors 
for  reducing  search  algorithm  execution  time  is  not  processor  efficiency,  but  the  heuristic  used  to 
determine  the  order  in  which  the  search  graph  is  explored! 

6. 2. 4-2  Distrjhutfd  List  With  No  Load  Balancing  for  the  same  reasons  discu.ssed  in 
the  small  scale  computer  section,  the  bL-NLB  algorithm  execution  time  curve  is  almost  a  constant 
when  using  more  than  Ui  processors.  I, ike  the  CL  algorithm,  increasing  the  number  of  processors 
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does  not  decrease  the  execution  time.  Also  like  the  CL  algorithm,  the  100  city  problem  execution 
time  increased  when  going  from  16  to  32  processors.  The  minor  changes  required  to  add  load 
balancing  to  this  algorithm  are  definitely  worth  the  cost. 

6. 2. 4- 3  Distributed  List  With  Load  Balancing  and  Distribution  This  section  discusses 
both  the  DL-LB  and  DL.DIST  algorithms.  A  comparison  of  these  algorithms  is  provided  in  section 
6.2.5.  While  only  the  100  city  and  65  city  DLXB  problems  were  able  to  be  run  using  more  than 
16  processors,  the  trends  are  obvious.  The  execution  time  curves  had  already  begun  to  flatten 
out  when  using  less  than  16  processors.  When  using  more  than  16  processors  this  trend  continues 
and  the  curves  flatten  even  more.  Processor  idle  time  has  increased  due  to  the  overhead  and 
communication  associated  with  load  balancing.  As  the  number  of  processors  increases,  the  more 
likely  it  is  for  a  processor  to  request  work  from  an  idle  processor  or  one  that  does  not  have  enough 
work  to  share.  However,  since  the  CL  algorithm  execution  times  have  become  almost  constant,  the 
DL.LB  and  DL.DIST  algorithms’  execution  times  are  approaching  the  CL  times.  For  the  100  city 
problem,  the  difference  between  the  DL.DIST  and  CL  algorithm  is  only  264  seconds  or  9%.  This 
much  variance  in  execution  time  was  noted  when  running  the  exact  same  problem  repeatedly. 

The  number  of  states  expanded  by  the  DL.LB  and  DL.DIST  algorithms  increases  dramatically 
with  the  number  of  processors  used.  This  is  due  to  the  OPEN  list  being  local  to  each  proces.sor  and 
not  global.  As  discussed  in  Chapter  III,  this  results  in  expanding  states  which  are  not  expanded 
when  using  a  global  OPEN  list.  This  tradeoff  of  wasted  work  versus  reducing  the  execution  time  is 
acceptable,  especially  since  the  distributed  list  algorithms  are  scalable  to  large  numbers  of  proces.sor. 

6.2. 4- 4  Algorithm  Summary  For  all  algorithms  and  problem  sizes,  the  CL  algorithm 
still  has  the  shortest  execution  time.  However,  as  the  number  of  processors  increases  above  16,  the 
DL.LB  and  DL.DIST  algorithms’  execution  times  continue  to  decrease  while  the  CL  execution  time 
IS  almost  constant.  At  approximately  32  proces.sors  the  execution  times  of  the  CL  and  both  DL 
algorithms  meet.  After  this  point,  the  DL  algorithms  should  be  more  efficient,  but  no  computer  was 
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available  to  validate  the  continuation  of  the  algorithm  execution  time  curves.  Using  the  number  of 
states  expanded  as  the  metric,  the  CL  algorithm  again  is  the  best. 

6.2.5  Covipartson  of  DLJjB  and  DL-DIST  Algorithms  Many  articles  discussed  distributed 
list  load  balancing  and  distributing  nodes  from  the  OPEN  list  in  an  attempt  to  emulate  a  global  list. 
However,  the  only  guidance  provided  by  any  article  was  to  state  that  determining  when  and  how 
to  balance  the  load  Wcis  very  difficult  and  each  algorithm  or  problem  was  unique  [Abderlrahman 
and  Mudge,  1988:  1495]  [Quinn,  1990:  385-387]  [Felten, 1988:1496]  [Hayes  and  Mudge,  1989:  1836] 
[Cvetanovic  and  Nofsinger,  1988:  86-89]  This  section  discusses  the  differences  between  the  DL_LB 
and  DL-DIST  algorithms  and  provides  some  general  guidelines  and  observations  about  when  and 
how  to  balance  the  work  load  when  using  a  distributed  list  algorithm.  Most  of  the  data  gathered 
was  using  16  or  less  processors  because  of  the  hardware  problems  with  the  Oak  Ridge  National 
Laboratory’s  iPSC/2  hypercube. 

Since  there  are  two  variations  of  the  algorithm  investigated  by  this  research,  there  are  4 
combinations  of  the  variations.  The  combinations  are: 

•  Distributed  list  with  load  balancing  (DL.LB)  —  The  term  share  is  used  to  describe  peissing 
work  from  one  processor  to  another  for  the  purpose  of  balancing  the  work  loads.  Sharing 
work  is  only  initiated  when  a  processor  is  idle. 

•  Distributed  list  with  distribution  but  no  load  balancing  (DL.LB)  —  The  term  distribute  is 
u.sed  to  describe  p.assing  work  from  one  processor  to  another  for  the  purpose  of  emulating  a 
global  OPF’/N  list.  Distributing  work  can  be  done  any  time  during  program  execution. 

•  Distributed  list  with  load  balancing  and  <iistribution  (DL.DIST)  —  This  algorithm  uses  both 
load  balancing  and  distribution.  This  algorithm  also  encompas.ses  the  fourth  variation  of 
using  distribution  and  load  balancing. 
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The  distributed  list  with  distribution  and  no  load  balancing  is  very  similar  to  the  distributed 
list  with  no  load  balancing.  The  differences  are  examined  when  the  DL_LB  and  DL_DIST  algorithms 
are  discussed. 

6.2.5. 1  Load  Balancing  As  stated  in  Chapter  IV,  there  are  two  aspects  to  sharing 
work.  First,  a  processor  must  be  idle  and  second,  another  processor  must  have  enough  work  to 
share.  The  variable  share  is  used  to  determine  the  number  of  nodes  required  on  a  processor’s  OPEN 
list  before  it  can  share  work  with  another  processor.  As  stated  in  Chapter  II,  there  are  tradeoffs 
to  consider  when  determining  when  and  how  to  share  work.  Sharing  work  too  often  and  the 
communication  overhead  negates  any  advantage  from  sharing;  don’t  share  enough  and  processors 
remain  idle! 

Looking  at  the  data  in  Appendix  B  Tables  B6  and  B7,  the  first  thing  to  observe  is  the  two 
smaller  problems  have  the  .same  share  variable  and  the  two  larger  problems  have  the  same  share 
variable.  This  indicates  the  problem  size  has  an  effect  on  how  often  to  share.  The  larger  problems 
require  more  time  to  determine  if  the  next  city  added  to  the  partial  tour  is  already  in  the  path,  are 
the  cities  now  a  tour,  and  calculate  the  estimated  cost  to  complete  the  tour.  This  means  there  is  a 
relatively  large  amount  of  computations  required  for  the  larger  problems.  Therefore,  once  work  is 
shared,  it  takes  a  longer  period  of  time  before  the  large  problems  would  request  work  again.  This 
allows  the  share  variable  to  be  set  smaller  and  optimize  the  computation  to  load  balancing  overhead 
ratio.  When  the  share  variable  is  too  small,  a  processor  sends  work  to  a  requesting  processor  and 
then  quickly  finishes  its  own  remaining  work.  Now  the  processor  is  idle  and  must  request  work. 
This  cycle  continues  with  relatively  few  states  being  expanded  and  a  relatively  large  time  spent 
in  load  balancing.  This  is  similar  to  thrashing  where  a  processor  is  constantly  requesting  data 
from  secondary  memory  and  little  computation  is  performed.  Share,  thrashing  occurs  when  load 
balancing  overhead  dominates  the  computations  performed  to  solve  the  problem. 
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Another  factor  effecting  the  share  variable  is  the  number  of  children  produced  by  a  problem. 
In  this  research,  the  smaller  problems  produced  fewer  children.  For  example,  the  20  city  problem 
produces  at  most  20  children  while  the  100  city  problem  produces  at  most  100  children.  This 
means  once  a  large  problem  receives  work,  it  is  more  likely  to  produce  children  and  need  not 
retpiest  additional  work  immediately.  Problems  which  generate  small  amounts  of  additional  work 
need  to  share  less  frequently,  but  share  larger  blocks  of  work.  If  a  problem  generates  few  children, 
share  thrashing  could  occur  if  the  share  variable  is  set  too  low. 

Especially  for  the  large  problems,  there  is  a  large  decrease  in  the  execution  times  between  the 
optimal  share  value  and  the  next  lower  value.  For  example,  the  100  city  problem  has  an  execution 
time  of  27,838  seconds  for  a  share  value  of  3  and  9,210  seconds  for  a  share  value  of  4.  When  the 
share  value  was  set  at  2,  both  the  100  and  65  city  problems  were  terminated  after  running  24  hours 
and  did  not  appear  close  to  finishing.  The  additional  overhead  of  load  balancing  should  not  account 
for  such  a  large  increase  in  execution  times,  especially  since  tlie  problems  require  relatively  large 
computation  time  and  generate  a  large  number  of  children.  What  was  determined  was  that  while 
the  computation  time  re(|uiretl  to  expand  a  state  remained  approximately  constant,  the  number  of 
children  generated  did  not.  This  is  due  to  two  reasons.  First,  as  Figure  3-10  shows,  the  number 
of  children  generated  decreases  at  each  level  of  the  search  tree  due  to  reduced  combinations  of 
solutions  available.  The  .second  rea.son  is  that  as  the  algorithms  progress,  the  cost  used  to  bound 
the  solutions  becomes  closer  to  the  optimal  solution.  This  increasingly  eliminates  children  which 
are  generated  from  being  placed  on  the  OPEN  list  because  their  estimated  cost  to  completion  is 
already  higher  than  the  current  bc.st  solution.  After  the  bounding  cost  gets  relatively  close  to  the 
optimal  cost,  the  large  problem  behaves  like  a  small  problem  as  far  as  child  generation  is  concerned, 
'fliis  is  why  the  optimal  share  variables  are  so  close  for  both  the  large  and  small  problems  and  the 
execution  times  vary  .so  drastically  with  a  .small  change  in  the  share  value. 
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This  situation  where  the  large  problems  act  like  small  problems  lends  itself  to  a  graduating 
scale  approach  to  selecting  the  share  variable.  At  the  beginning  of  the  program,  the  share  variable 
can  be  relatively  low  with  the  value  increasing  as  the  bounding  cost  approaches  the  optimal  cost. 
VV  hen  to  increase  the  share  value  could  be  determined  by  the  percent  of  possible  children  actually 
generated  for  expansion.  The  higher  the  percent,  the  lower  the  share  value. 

A  third  factor  effecting  the  choice  of  a  share  variable  is  the  processor  computational  speed.  As 
the  '■omputational  speed  increases,  work  should  be  shared  less  often.  This  is  because  the  processors 
can  quickly  expand  the  states.  If  the  share  variable  is  too  low-,  share  thrashing  occurs.  An  example 
of  this  is  comparing  the  iPSC/2  hypercube  to  the  iStiO  hypercuhe  using  identical  problem  sizes, 
algorithms,  and  share  values.  As  explained  in  .section  6.4,  the  computational  speed  is  approximately 
I  t  times  faster  for  the  ’860.  Using  8  processors,  a  share  value  of  7  and  the  55  city  problem,  the 
iPSC/2  requested  work  243  times  with  a  run  time  of  1103  seconds  while  the  i860  requested  work 
687  times  with  a  run  time  of  172  seconds.  However,  when  the  i860  share  value  was  increased  to 
10,  the  nm  time  decrea.sed  to  141  seconds  and  work  requests  dropped  to  350.  While  this  is  not 
conclusive?  proof,  the  trend  held  for  all  algorithms  and  share  values  tested. 

6. 2. 5. 2  Dtstnbutton  The  idea  of  distributing  the  workload  to  achieve  greater  efficiency 
is  discussed  by  Felten  [Feldman,  1989;  15-0-1504]  Cvetanovic  and  Nofsinger  [Cvetanovic  and  Nof- 
singer,  1990:  82-90]  and  many  others.  Again,  the  only  guidance  provided  was  that  each  problem  is 
uni<|u<'  and  that  the'  optimal  distribute  variable  must  be  found  by  trial  and  error. 

In  this  research  study,  the  distribute  variable  <lctermines  the  frequency  of  a  processor  sending 
work  to  its  nearest  neighbors.  A  counter  is  incremented  after  every  expansion  of  a  state.  When 
the  counter  ecpials  the  distribute  variable,  work  is  distributed.  Another  condition  required  for 
distribution  is  that  the  number  of  nodes  on  the  OPEN  list  be  equal  to  or  greater  than  the  share 
value  of  the  variable.  This  ensures  e  processor  does  not  distribute  work  and  then  have  to  request 
work 
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I'he  DL_D1ST  algorithm  has  all  the  advantages  and  disadvantages  of  the  DL-LB  algorithm 
with  the  addition  of  the  distributing  of  work.  In  these  algorithms  the  1)L_LH  algorithm  had  the 
most  impact  on  reducing  execution  time.  The  Dl.-DIST  algorithm  oidy  added  a  relatively  small 
amount  to  the  reduction  in  run  times.  For  example,  the  e.xecution  time  for  the  65  city  problem 
using  16  processors  and  the  DL.NLB  algorithm  decreased  from  13762  seconds  to  8901  seconds  when 
tile  l)L_f.B  algorithm  was  used.  This  is  a  dccreiise  of  3.5'/{.!  However,  using  distributed  list  with  no 
load  balancing  and  distribution,  the  run  time  only  d,  crea,s<'d  to  12981  seconds,  a  decrease  of  about 
5'a.  This  shows  the  predominate  factor  in  reducing  run  time  is  balancing  the  work  load  to  keep  all 
the  processors  busy. 

One  goal  of  distributing  work  i.s  to  more  clo.sely  (-mulate  the  global  OPEN  list  of  the  CL 
algorithm.  'The  data  in  Figures  Bo,  B6,  B7,  and  B8  show  the  DL-DLS'l'  algorithm  consistently 
expands  fewer  states  than  the  other  two  distributed  list  algorithms.  Also,  the  data  in  Tables  B8 
and  B9  shows  that  the  number  of  states  expanded  is  inversOy  proportional  to  the  frequency  of  work 
being  distributed.  This  shows  that  by  distributing  work  more  fre(|uently,  the  closer  the  algorithm 
emulates  the  global  OPEN  list.  Ltifortunately,  the  added  distributions  al.so  incur  additional  over¬ 
head  to  proci'ss  and  send  the  work  to  other  proce.ssors.  So  while  tin'  goal  of  more  closely  emulating 
the  global  OPEN  list  was  met.  the  adilitional  overhead  requires  tradeoffs  between  the  benefits  of 
distributing  work  and  the  overhead  incurred. 

Because  the  load  balancing  i.s  the  dominanti'  factor  in  reducing  program  execution  time,  few 
observations  can  be  made  about  the  (.listribute  variable  f  irst,  if  the  distribute  variable  is  too  low, 
a  [iroblem  similar  to  share  thrashing  occurred,  futile  compulation  wa.s  getting  done  because  all  the 
time  wa.s  spent  in  distributing  work.  'I'he  distribute  thrashing  problem  is  relatf'd  to  the  amount  of 
conqnitation  re(|nired  to  expand  the  states.  I, ike  the  share  variable,  the  more  com[>utation  ref|uired 
to  ex[iand  the  states,  the  more  frecpiently  work  can  be  distnbute<|. 
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One  problem  not  mentioned  in  the  literature  is  not  allowing  work  to  be  distributed  if  the  cost 
of  the  nodes  being  distributed  is  constantly  the  same.  For  example,  using  8  processors  and  the  100 
city  problem,  the  DL.DIST  algorithm  reaches  a  point  where  there  are  approximately  17,000  nodes 
on  all  OPEN  lists.  Of  these  17,000  nodes,  approximately  90%  have  a  cost  of  134.  Therefore,  it 
makes  no  sense  to  distribute  nodes  when  all  the  OPEN  lists  have  the  same  cost.  This  problem  is 
reduced  by  having  a  counter  keep  track  of  the  number  of  times  work  was  distributed  with  the  same 
cost.  If  the  counter  reaches  a  predetermined  number,  no  work  is  distributed  until  the  cost  of  tl.e 
node  on  the  front  of  the  OPEN  list  changes.  For  this  research,  the  number  was  set  at  approximately 
50.  This  number  was  selected  to  allow  nodes  of  the  same  cost  to  be  spread  among  all  the  proce.ssors, 
but  keep  nodes  from  being  unnecessarily  distributed.  While  the  effect  on  the  algorithm  was  not 
great,  it  did  reduce  the  100  city  problem  using  8  •^r.-cessor,-  from  13867  seconds  run  time  with 
5862  states  expanded  to  13011  secomU  run  time  and  5844  states  expanded.  This  addition  to  the 
DL.DIST  algorithm  allows  problems  whicli  ;^‘nei,itc  large  numbers  of  children  with  the  same  cost 
to  efficiently  distribute  work. 

Having  a  large  number  of  nodes  with  the  same  cost  diminishes  the  expected  reduction  in  the 
program  run  time. 

For  example,  compare  the  65  city  problem  which  has  few  nodes  with  the  same  cost  and  the 
100  city  problem  which  has  many  nodes  with  the  same  cost.  The  most  same  cost  nodes  observed 
when  u.sitig  8  proce.ssors  in  the  65  city  problem  was  approximately  1200  nodes  with  a  cost  of  137 
while  the  100  city  problem  17,000  nodes  with  a  cost  of  134.  In  the  65  city  problem,  the  diflerence 
in  execution  times  between  the  DL.DIST  and  DL.fTl  algorithms  continues  to  increase  as  more 
proces.sors  are  added.  In  contrast  the  when  using  the  100  city  problem,  the  DI,_LH  algorithm 
never  clearly  ontp<'rforms  the  DL-DIS'L  algorithm.  When  using  large  numbers  of  processors,  the 
DI._DIS  r  and  DI._LH  algorithms  have  almost  identical  execution  times  for  large  nimbers  of  same 
ciisl  iMxh's.  fliis  is  because  by  using  more  processors,  the  program  fpiickly  finds  near  optimal 
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solutions  and  prunes  most  branches  of  the  search  tree.  Since  there  are  numerous  nodes  with  the 
same  cost  which  are  within  1%  of  the  optimal  solution,  this  means  the  nodes  left  to  expand  are 
not  distributed.  Therefore,  the  function  which  keeps  work  from  being  distributed  when  the  nodes 
have  the  same  cost  forces  the  DL  JDIST  algorithm  to  emulate  the  DL_LB  algorithm  as  the  number 
of  nodes  with  the  same  cost  increases. 

6.3  IDA*  Versus  Centralized  List 

One  goal  of  this  research  is  to  compare  the  IDA*  algorithm  with  the  DL.DIST  algorithm.  For 
all  problem  sizes,  this  implementation  of  the  IDA*  algorithm  outperformed  the  DL.DIST  algorithm 
in  both  execution  time  and  states  expanded.  This  is  because  the  IDA*  algorithm  is  very  similar  to 
the  CL  algorithm  and  states  arc  expanded  in  approximately  the  same  order  as  the  CL  algorithm. 
As  discussed  previously,  this  is  one  of  the  main  factors  in  reducing  algorithm  execution  time  and 
number  of  states  expanded.  Since  the  IDA*  algorithm  more  closely  resembled  the  CL  algorithm, 
it  is  compared  to  it  insteail  of  the  DL.DIST  algorithm. 

Because  of  the  1D.‘\*  algorithm  implementation,  only  the  number  of  states  expanded  can  be 
compared  to  other  algorithms.  The  IDA*  algorithm  consistently  expanded  more  states  than  the 
CL  algorithm.  This  is  because  the  CL  .  >rithm  uses  the  assignment  problem  as  the  function 
to  estimate  the  remaining  cost  to  completion  for  that  state.  As  discussed  in  Chapter  III,  the 
assignment  problem  can  provide  a  .solution  to  the  search  problem.  This  is  a  form  of  depth  first 
.search  that  the  IDA*  implementation  does  not  exploit.  To  balance  the  algorithms  for  comparison 
|)urpose.  the  (JL  algorithm  is  required  to  expand  all  levels  of  the  search  graph.  This  new  CL 
algorithm  is  called  ‘‘level". 

Comparing  the  level  and  IDA*  algorithms  produces  mixed  results.  While  neither  algorithm 
always  expands  the  fewest  number  of  states,  the  IDA*  does  constantly  expand  less.  For  example, 
when  using  ()'>  cities,  the  IDA*  expanded  as  tnnch  as  9(549  fewer  states  or  480%  less!  Whether  this 
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is  due  to  the  inherent  superiority  of  the  IDA*  algorithm  or  caused  by  the  implementations  of  the 
two  algorithms  could  not  be  determined  in  this  research.  Other  factors  which  could  effect  which 
algorithm  to  use  and  could  be  investigated  include  memory  requirements  for  storing  the  OPEN  list 
and  the  cost  of  maintaining  an  OPEN  list  versus  the  cost  of  repeatedly  generating  and  expanding 
the  states.  More  study  using  different  implementations  of  the  IDA*  and  DL.DIST  algorithms  is 
required  to  determine  which  algorithm  is  actually  better. 

6.4  Guidelines  for  Distributed  Memory  Computer  Implementation  of  A*  Algorithms 

This  section  is  a  summary  of  what  was  learned  about  implementing  A*  algorithms  on  dis¬ 
tributed  memory  computers  using  data  decomposition.  This  section  is  composed  of  three  parts: 
determination  of  whether  to  use  a  centralized  or  distributed  list  algorithm,  factors  effecting  the  use 
of  a  distributed  list  algorithm,  and  factors  effecting  the  use  of  work  distribution. 

The  most  important  decision  is  whether  to  use  a  centralized  list  or  distributed  list  to  storc 
the  states  to  be  e.xjianded.  The  following  are  some  general  guidelines  on  which  list  to  use: 

•  Is  the  problem  to  be  scalable  to  a  large  number  of  processors?  If  the  algorithm  is  to  be 
scalable  to  a  large  number  of  processors,  then  use  the  distributed  list.  If  the  algorithm  is  only 
going  to  be  run  on  a  small  number  of  proce.ssors,  then  use  a  centralized  list. 

•  beterrnine  the  boundary  between  centralized  and  distributed  list.  Sometimes  the  number  of 
l)roces.sors  available  might  be  on  the  borderline  between  which  list  is  optimal.  The  determi¬ 
nation  of  which  list  to  use  is  then  determined  by  trial.  However,  the  number  of  processors 
effectively  controlled  using  a  centralized  list  varies  and  is  dependent  on  the  communica¬ 
tion/interconnection  network  and  the  processor  computational  speed  of  the  individual  com¬ 
puter.  The  faster  the  processor,  the  larger  the  number  of  processors  efficiently  used  with  a 
centralized  list.  For  tin'  Intel  iPSC’/'-l,  the  centralized  list  is  efficient  up  to  approximately 
If)  proce.s.sors,  and  possibly  more  efficient  than  the  distributed  list  until  about  .‘52  processors. 
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The  Intel  i860  can  efficiently  control  approximately  40  processors  using  a  centralized  list  with 
the  computation  requiring  40  ms  and  using  the  short  message  protocol  [Work,  1991]. 

The  following  is  a  guideline  for  when  using  a  distributed  list: 

•  Unless  there  is  some  special  attribute  of  the  problem  known  during  algorithm  design  to 
preclude  it,  load  balancing  is  required  to  make  the  algorithm  run  efficiently. 

•  The  number  of  states  awaiting  expansion  on  a  processor  before  it  can  share  work  to  load 
balance  is  dependent  on: 

-  Problem  size  —  States  awaiting  expansion  before  allowing  load  balancing  are  inversely 
proportional  to  the  amount  of  computation  required  to  expand  one  state. 

--  Children  generated  —  States  awaiting  expansion  before  allowing  load  balancing  are 
inversely  proportional  to  the  number  of  possible  children  generated  by  each  state. 

Graduating  scale  —  Consid  -  making  the  states  awaiting  expansion  on  a  processor  before 
it  can  share  work  dependent  on  at  what  stage  of  the  program  it  is  in.  For  example,  require 
a  relatively  small  number  of  states  awaiting  expansion  at  the  beginning  of  the  program, 
increasing  the  states  required  as  the  program  progresses.  At  the  end  of  the  program, 
a  relatively  large  number  of  states  is  required  for  a  processor  to  share  work  by  load 
balancing. 

-  Processor  speed  —  States  awaiting  expansion  before  allowing  load  balancing  is  directly 
proportional  to  the  computational  speed  of  the  processor. 

While  the  major  impact  on  program  execution  time  is  from  load  balancing,  distributing  work 
ran  al.so  reduce  execution  time.  Factors  effecting  distribution  are: 

•  Numerous  states  with  same  cost  Always  have  some  function  to  keep  from  distributing  work 
if  the  states  have  the  same  cost.  The  minimal  overhead  of  keeping  track  of  the  number  of 


Table  6.1.  States  expanded  by  processor  using  CL  and  100  cities 


same  cost  states  distributed  is  very  minimal  compared  to  the  cost  of  removing  and  inserting 
the  states  into  the  OPEN  list  and  transmitting  the  states  between  processors. 

•  The  more  computation  required  to  expand  a  state,  the  fewer  states  expanded  between  work 
distributions. 


While  this  research  did  not  investigate  methods  to  increase  the  efficiency  of  the  CL  algorithm, 
a  couple  of  observations  were  noted.  First,  slave  proce.ssor  efficiency  is  inversely  related  to  the 
frequency  of  wr-k  requests  to  the  master  processor.  There  is  a  wide  variance  in  the  number  of 
states  expanded  by  each  processor  in  solving  a  problem.  For  example,  the  100  city  problem  using 
8  processors  had  the  distribution  in  Table  6.1. 

Notice  the  lower  the  processor  number  the  more  states  it  expanded.  While  this  is  dependent 
on  the  manner  in  which  idle  processors  are  selected  to  send  work  to,  this  does  show  a  few  of  the 
processors  are  performing  most  of  the  work.  Processors  1  and  2  exjianded  164  out  of  355  states 
or  46%!  The  individual  processor  efficiencies  also  reflect  the  same  trend  with  processor  1  having 
the  highest  efficiency  and  processor  7  the  lowest.  This  shows  processors  are  waiting  for  work  to  be 
assigned  to  them.  Therefore  some  method  of  keeping  the  fre(|uency  of  work  requests  low  should  be 
investigated. 

Second,  the  computation  speed  vs  communication  speed  ratio  is  a  major  factor  in  determin¬ 
ing  the  number  of  slave  processors  a  master  proces.sor  can  efficiently  cont  rol.  For  example,  when 
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using  the  Intel  iPSC/2  hypercube,  the  master  processor  controls  approximately  16  slave  proces¬ 
sors  before  the  communications  bottleneck  does  not  allow  the  efficient  addition  of  more  processors 
to  the  problem.  However,  the  Intel  i860  hypercube  can  control  an  estimated  40  slave  processors 
Work,  1991  :  ]  While  little  data  was  collected  using  the  Intel  i860  hypercube  during  this  re¬ 
search,  the  data  collected  does  appear  to  support  that  a  master  processor  on  the  i860  hypercube 
can  efficiently  control  more  processors  than  the  iPSC/2.  Since  both  computers  have  the  same 
interconnection/communication  system,  the  only  difference  is  the  computational  speed  of  the  pro¬ 
cessors.  Test  performed  by  Richard  Norris,  the  Air  Force  Institute  of  Technology  iPSC/2  system 
administrator,  show  the  ratio  in  computational  speed  between  the  i860  and  the  iPSC/2  is  about 
14:1.  The  overall  ratio  of  execution  times  between  the  i860  and  iPSC/2  is  about  8:1.  The  differ¬ 
ence  between  computation  speed  and  execution  times  is  caused  by  both  computers  using  the  same 
interconnection/communication  network. 

6.5  Recommendation  for  Further  Research 

In  any  research  effort,  there  is  always  work  you  did  not  have  time  to  perform  and  new  ideas 
that  evolved  but  were  not  explored.  The  following  is  a  list  of  topics  to  further  extend  this  research: 

1.  Investigate  a  dynamic  algorithm  that  is  a  combination  centralized  list  and  distributed  list.  As 
noted  earlier,  much  of  the  processor  idle  time  was  waiting  for  the  master  processor  to  provide 
work  to  the  slave  processor.  One  way  to  alleviate  this  is  to  have  the  slave  processor  send 
th..'  low  cost  children  it  generates  to  the  master  processor  and  keep  a  portion  of  its  high  cost 
children  generated.  This  reduces  communication  costs  and  provides  low  priority  work  for  the 
slave  processor  while  it  is  waiting  for  work  from  the  master  processor.  Another  possibility 
is  to  have  the  slave  processors  which  are  consistently  waiting  for  work  keep  a  small  portion 
of  low  cost  children  for  expansion  while  waiting  for  the  master  processor  to  .send  work.  The 
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processors  which  consistently  wait  for  work  are  easily  determined  cis  discussed  earlier  in  this 
chapter. 

2.  Since  the  centralized  list  is  the  most  efficient  for  small  numbers  of  processors,  investigate 
ways  to  optimize  the  use  of  this  list.  Many  Air  Force  requirements  in  the  future  will  require 
small  parallel  computers  on  board  aircraft  to  perform  functions  now  being  performed  at  the 
support  base  or  not  being  performed  at  ail. 

3.  Investigate  other  methods  of  distributing  work  among  processors.  One  method  suggested 
by  Felten  is  to  randomly  select  proces.sors  to  distribute  work  to  instead  of  sending  it  to  the 
nearest  neighbors  [Felten,  1988:  .504]. 

4.  Determine  the  number  of  processors  a  master  processor  can  efficiently  control  on  the  Intel 
i860  RISC  computer. 

5.  Make  the  termination  seciuence  for  the  distributed  list  more  efficient.  If  a  processor  is  still 
idle  after  requesting  work  from  all  other  processors,  it  goes  into  a  loop  waiting  for  the  RING 
to  come  to  it  to  terminate  the  process.  Sometimes, a  processor  which  did  not  have  enough 
work  to  share  keeps  generating  children  and  continues  to  work  long  after  other  processors 
are  idle.  After  a  processor  has  been  idle  for  a  predetermined  time,  it  could  force  a  working 
processor  to  share  any  of  its  remaining  work. 

t3.  Investigate  more  fully  the  differences  between  IDA*  and  the  distributed  list  algorithms. 

6.6  Summary 

Because  of  the  combinatoric  nature  of  NP-complete  problems,  they  will  continue  to  be  diffi¬ 
cult  and  time  consuming  to  solve.  Many  factors  encourage  the  use  of  parallel  computers  to  solve 
the.se  problems.  First,  parallel  computer  costs  are  decreasing  while  their  performance  is  increas¬ 
ing.  Secondly,  these  problems  have  inherent  parallelism  as  demonstrated  by  the  relative  ease  with 
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which  they  are  decomposed  using  data  decomposition.  Also,  it  is  getting  more  difficult  to  incrccise 
performance  of  sequential  computers. 

This  research  investigated  NP-complete  problems  on  distributed  memory  architecture  com¬ 
puters  by  implementing  a  traveling  salesman  problem  using  a  variation  of  the  A*  algorithm.  Dif¬ 
ferences  between  using  a  centralized  or  distributed  list  to  store  the  states  waiting  to  be  expanded 
were  explored.  Three  distributed  list  algorithms,  DL.NLB,  DLJLB,  and  DL_DIST,  were  designed 
and  implemented.  Advantages  and  disadvantages  of  each  were  discussed  and  compared  to  the 
centralized  list  algorithm. 

'I'he  centralized  list  algorithm  is  found  to  perform  better  than  the  distributed  list  algorithms 
when  using  a  small  number  of  [iroce.ssors.  However,  this  algorithm  produces  a  communication 
bottleneck  and  is  not  scalabh'  to  a  large  number  of  processors.  While  the  number  of  processors  the 
master  proot'ssor  can  efficiently  control  is  application  and  computer  dependent,  some  guidelines  are 
given  to  help  determine  this  value. 

Because  of  the  overhead  rcciuired  for  load  balancing,  the  distributed  list  algorithms  are  not  as 
efficient  ;vs  t  he  centralized  list  algorithm  for  small  numbers  of  processors.  However,  the  distributed 
list  algorithms  arc  scalable  to  large  numbers  of  processors  and  become  more  efficient  than  the 
centralized  list  algorithm.  Factors  relating  to  the  efficiency  for  load  balancing  and  distributing  the 
work  are  discussed.  Major  factor  to  load  balancing  and  distributing  efficiency  include  computation 
reejuired  to  t'xpand  <';\ch  state,  number  of  children  generated  by  each  state,  at  what  stage  in  the 
program  y<ju  are,  and  the  computational  speed  of  the  processors. 
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Appendix  A.  Structure  Charts 


A.J  Inirodi^ction 

This  appendix  shows  the  relationships  between  the  different  functions  and  subroutines  of  each 
algorithm  using  a  structure  chart. 

A. 2  Centralized  List  Algorithm 

The  centralized  list  algorithm  consists  of  three  main  algoritiims,  each  with  subroutines  and 
functions.  The  three  algoritiims  are  the  Host,  Control,  and  Worker  algorithm, s.  The  IDA*  structure 
charts  are  identical  to  the  centralized  list  charts.  The  difference  in  the  algorithms  is  when  the 
N’ODEs  are  inserted  into  the  OPEN  list.  Their  structure  charts  are: 


Figure  A. 3  Ceiilralized  [.ist,  Worker  Structure  Chart 


A. 3  Dtsirtbuted  List  Algorithms 


The  distributed  list  algorithms’  structure  charts  are  shown  below  are  each  shown  below. 
Since  the  host  algorithm  is  the  same  for  all  variations  of  the  algorithm,  it  is  only  shown  with  the 
distributed  list  with  load  balancing. 

A. 3.1  Distributed  List  with  Load  Balancing  The  following  structure  charts  are  for  the  di.s- 
tributed  list  with  load  balancing  algorithm: 


r  le'-  .4,4.  Distributed  List  Host  Structure  Chart 


Figure  A. 5.  Distributed  List  Worker  Structure  Chart 


A. 3. 2  Distubuted  List  with  Load  Balancing  and  Distribution  Tlie  following  structure  cliarts 
are  for  the  distributed  list  with  load  balancing  and  distribution  algorithm: 


Figure  A. 6.  Distributed  List  Host  Structure  Chart 


Figure  A. 7.  Distributed  List  W'orker  Structure  Chart 


Appendix  B.  Test  Results  and  Data 


B.  I  Introduction 

This  appendix  presents  the  test  data  from  all  the  algorithms  tested.  Each  algorithm  was 
tested  using  four  different  size  problems;  22,  55,  65,  and  109  cities  and  stored  in  files  n22a,  n55a, 
n65a,  and  nlOOa  respectively.  The  description  of  the  problems  is  in  Appendix  C. 

The  algorithms  tested  and  their  abbreviation  used  are: 

1.  Centralised  list  (CL)  —  Shown  on  charts  and  tables  as  tsp 

2.  Iterative  Deepening  A*  -  1D.‘\* 

3.  Centralized  list  with  levels  —  Shown  on  charts  and  tables  as  level 

■1.  L'entralizod  list  with  levels  —  level 

5.  Distributed  list  with  no  load  balancing  (DL.NLB)  —  Shown  on  charts  and  tables  as  nib 

6.  Distributed  list  with  load  balancing  (DL-LB)  —  Shown  on  charts  and  tables  as  dish 

7.  Distributed  list  with  load  balancing  and  distribution  (DL-DIST)  —  Shown  on  charts  and 

tabh'S  as  dist 

/?.2  Data 

'fhis  section  pre.sents  the  data  in  three  different  forms.  The  data  is  first  presented  in  tables 
providing  all  tlie  pertinent  information  collected  about  the  algorithms.  The  data  is  then  presented 
in  graph  form  for  e;ise  of  understanding  and  to  display  trends  in  the  data.  The  CL  algorithm  was 
not  run  using  2  processors  f)ecause  then  there  would  be  the  Control  processor  and  only  one  Worker 
proces,sor.  d  his  is  the  same  as  the  seriuential  algorithm  except  with  the  parallel  communication 
overhead.  Also,  some  algorihms  were  not  run  using  32  processors  due  to  hardware  problems  with 
the  if’S(72  computer. 
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,  I 


FILE  n22a 


FILE  n55a 


RUN 

DES  TIME 


16.83 


NOT  USE 


6.52 


STATES 

EXPAND 

AVERAGE 

EFFICIENCY 

25 

0.823 

9.493 

44 

0.290 

7.801 

72 

0.270 

6.775 

77 

0.256 

RUN 

DES  TIME 


12341.3 


NOT  USED 


4153.2 


1825.8 


925.1 


FILE  n65a 


STATES  AVERAGE 
EXPAND  EFFICIENCY 


RUN 

TIME 

STATES 

EXPAND 

AVERAGE 

EFHCIENCY 

754.57 

148 

0.996 

NOT  USED 

431.42 

157 

0.882 

219.71 

177 

0.656 

220.996 

241 

0.487 

221.001 

288 

0.300 

FILE  nlOOa 

RUN 

TIME 

STATES 

EXPAND 

AVERAGE 

EFnCIENCY 

25724.1 

737 

0.999 

NOT  USED 


7201.7 


2720.5 


2671.289 


558.588 


2671.333 


FILE  n22a 


FILE  n55a 


#  RUN 

NODES  TIME  (sec) 


2 

29.8 

74 

0.856 

1142.8 

443 

0.988 

4 

27.0 

124 

0.841 

1142.8 

739 

0.988 

8 

27.0 

224 

0.842 

1142.8 

1331 

0.988 

16 

24.0 

419 

0.829 

1144.0 

2501 

0.987 

32 

24.8 

803 

0.832 

1143.1 

4115 

0.988 

RUN 

TIME  (sec) 


NE  =  taal  states  expanded 
EFF  =  average  efficiency  per  node 


FILE  n65a 


#  RUN 

NODES  TIME  (sec) 


2 

22485 

5017 

0.838 

41151 

2810 

0.947 

4 

17921 

8866 

0.843 

33504 

4103 

0.950 

8 

15534 

14971 

0.851 

31145 

7113 

0.952 

16 

13762 

33025 

0.825 

28971 

14892 

0.966 

32 

13015 

81132 

0.861 

30655 

28463 

0.967 

FILE  111  00a 


RUN 

TIME  (sec) 


NE  =  total  states  expanded 
EFF  =  average  efficiency  per  ncxie 


Table  B.2.  Distril>ut<'(I  List  with  no  Load  HalanriiiK  Data 


FILE  n22a 


FILE  n55a 


#  RUN 

NODES  TIME  (sec) 


SH  EFF 


2 

25.1 

74 

7 

0.805 

1142 

443 

6 

0.983 

4 

33.8 

193 

47 

0.820 

1140 

783 

74 

0.993 

8 

24.1 

305 

235 

0.884 

1103 

1495 

243 

0.709 

16 

23.8 

517 

469 

0.798 

1089 

1752 

701 

0.738 

RUN 
TIME  (sec) 


SH  EFF 


TE  =  total  states  expanded 

SH  =  number  of  times  work  was  shared 

EFF  =  average  efficiecy  per  node 


file  n65a 


#  RUN 

NODES  TIME  (sec) 


SH  EFF 


FILE 

nlOOa 

RUN 
TIME  (sec) 

TE 

SH 

EFF 

2 

18846 

3033 

18 

0.943 

27221 

1620 

6 

0.997 

4 

15291 

7686 

24 

0.873 

20117 

3069 

25 

0.997 

8 

11529 

14108 

89 

0.891 

16381 

6013 

89 

0.997 

16 

6394 

26714 

483 

0.894 

9210 

11892 

320 

0.996 

32 

3229 

33097 

612 

0.651 

2995 

24493 

1184 

0.996 

TE  =  total  states  expanded 

SH  =  number  of  times  work  was  shared 

EFF  =  average  efficiecy  per  ncxle 


Table  B.3.  Distributed  List  with  Load  Balancing  Data 


FILE  n22a 


Table  H.o.  Dist  ributed  Idst  with  Load  Balancing  and  Distribution  2  of  2 


B.2.}  Execution  Tune  Graphs  This  section  presents  the  CL,  DL_NLB,  DL-LB,  and  DL_DIST 
algorithms’  execution  time  data  in  graphical  form  for  ease  of  understanding.  The  graphs  show  all 
four  algorithms  for  each  problem  size. 


Figure  B.l.  Execution  Time  Data  for  22  Cities 


Fignro  B.3.  FLxpcution  Time  Data  for  65  Cities 


Figure  B.4.  Execution  Time  Data  for  100  Cities 


B.2.2  States  Expanded  Graph  This  section  graphically  i)rescnls  tlie  number  of  states  ex 


paneled  by  all  algorithms  for  each  problem  size. 
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B.2.3  Share  Data  This  section  provides  the  data  on  load  balancing  .  f'irst,  the  table  for 


the  share  variables  and  then  the  graphs  are  provided. 


Figure  13.9.  E.xecution  Timefor  22  Cities 


Figure  13.10,  lixeciition  Time  for  55  Cities 
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FILE 

n22a 

N/S 

RUN 

TIME 

(sec) 

STATES 

EXPAND 

AVERAGE 

EFFICIENCY 

DIST 

SHARE 

16/3 

38.7 

847 

0.546 

0 

427 

16/5 

23.8 

517 

0.798 

0 

235 

16/7 

24.5 

506 

0.783 

0 

220 

N/S/B  =  #  of  nodes  used/share  variable 

DIST  =  #  of  times  program  distributed  work 
SHARE  =  #  of  times  program  shared  work 


FILE 

n55a 

N/S 

RUN 

TIME 

(sec) 

STATES 

EXPAND 

AVERAGE 

EFFICIENCY 

DIST 

SHARE 

16/4 

1213 

1740 

0.607 

0 

1034 

16/5 

1089 

1752 

0.738 

0 

701 

16/6 

1291 

1788 

0.699 

0 

683 

N/S/B  =  #  of  nodes  used/share  variable 

DIST  =  #  of  times  program  distributed  work 
SHARE  =  #  of  times  program  shared  work 


Table  B.b.  Share  Data  1  of  2 


H  i;$ 


HLE 

n65a 

N/S 

RUN 

TIME 

(sec) 

STATES 

EXPAND 

AVERAGE 

EFFICIENCY 

DIST 

SHARE 

16/3 

37295 

34089 

0.592 

0 

12382 

16/4 

6394 

26714 

0.894 

0 

483 

16/5 

6752 

28113 

0.884 

0 

494 

16/7 

6752 

28113 

0.884 

0 

494 

N/S/B  =  #  of  nodes  used/share  variable 

DIST  =  #  of  times  program  distributed  work 

SHARE  =  #  of  times  program  shared  work 

FILE 

nlOOa 

N/S 

RUN 

TIME 

(sec) 

STATES 

EXPAND 

AVERAGE 

EFFICIENCY 

DIST 

SHARE 

16/3 

27838 

18045 

0.639 

0 

13056 

16/4 

9210 

11861 

0.997 

0 

320 

16/5 

9662 

11804 

0.997 

0 

316 

16/7 

9662 

11804 

0.997 

0 

316 

N/S/B  =  #  of  nodes  used/share  variable 

DIST  =  #  of  times  program  distributed  wcxk 

SHARE  =  #  of  times  program  shared  work 

I’able  B.7.  Sliarp  Data  2  of  2 
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Figure  B.r2.  States  Expanded  for  22  Cities 


Figure  B.13.  States  Expatickd  for  55  Cities 


Figure  H.14.  States  lixpaiKled  for  (>5  Cities 


Num 


P'igure  B.15.  Stale 


FILE  n22a 


N/S/D 

RUN 

STATES 

AVERAGE 

DIST 

SHARE 

TIME 

EXPAND 

EFFICIENCY 

16/7/6 

13.9 

423 

0.694 

95 

400 

16/7/7 

13.2 

424 

0.694 

94 

400 

16/7/8 

14.0 

425 

0.621 

84 

391 

N/S/B  =  #  of  nodes  used/share  variable 
and  distribute  variable 

DIST  =  #  of  times  program  distributed  wcyk 
SHARE  =  #  of  times  program  shared  work 


FILE 

n55a 

N/S/D 

RUN 

STATES 

AVERAGE 

DIST 

SHARE 

TIME 

EXPAND 

EFFICIENCY 

16/5/6 

328 

2199 

0.800 

142 

280 

16/5/7 

297 

2193 

0.800 

145 

281 

16/5/8 

335 

2180 

0.765 

151 

274 

N/S/B  =  #  of  ncxlcs  uscd/share  variable 
and  distribute  variable 

DIST  =  #  of  times  prognun  distributed  work 
SHARE  =  #  of  times  program  shared  work 
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HLE 

n65a 

N/S/D 

RUN 

TIME 

STATES 

EXPAND 

AVERAGE 

EFHCIENCY 

DIST 

SHARE 

16/4/5 

2852 

20029 

0.965 

89 

450 

16/4/6 

2529 

21150 

0.984 

84 

450 

16/4/7 

2997 

26714 

0.979 

78 

450 

N/S/B  =  #  of  nodes  used/share  variable 
and  distribute  variable 

DIST  =  #  of  times  program  distributed  wcrk 

SHARE  =  #  of  times  program  shared  work 

FILE 

nlOOa 

N/S/D 

RUN 

TIME 

STATES 

EXPAND 

AVERAGE 

EFnCIENCY 

DIST 

SHARE 

16/4/5 

9672 

11029 

0.943 

161 

318 

16/4/6 

8901 

11858 

0.997 

142 

320 

16/4/7 

9513 

12026 

0.996 

140 

320 

N/S/B  =  #  of  nodes  used/share  variable 
and  distribute  variable 

DIST  =  #  of  times  program  distributed  work 

SHARE  =  #  of  times  program  shared  work 

Table  B.9.  Distribution  Data  2  of  2 
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Figure  B.18.  Execution  Time  for  65  Cities 


Figure  B.19.  Execution  Time  for  100  Cit  ies 


Figure  B.20.  States  Expanded  for  22  Cities 


Figure  n.21.  States  Expanded  for  55  Cities 


Figiirf'  JF23.  Statos  Fxpaiidcd  for  100  Cities 


B.3  IDA*  Data 


This  section  presents  the  data  comparing  the  centralized  list  algorithm  against  the  IDA* 
algorithm.  Both  tables  and  graphs  similar  to  those  presented  for  the  distributed  list  algorithms  are 
presented  here. 
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Figure  B.2‘1.  ID.\*  Fxeciition  Time  Data 
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FILE  n22a 


FILE  n55a 


NODES  TIME 


15.0 


IDA*  EFF 


19  0.850 


TE  =  t(X;U  states  expanded 

IDA*  =  states  exp:mded  during  IDA*  DFS 

EFF  =  average  et't'icieey  per  node 
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RUN 

TIME 

TE 

IDA* 

EFF 

744.6 

148 

139 

0.997 

482.6 

148 

139 

0.891 

391.6 

148 

139 

0.715 

435.4 

253 

220 

0.692 

452.9 

261 

222 

0.691 

FILE  n65a 


#  RUN 

NODES  TIME 


1682  0.999 


1928  0.987 


1948  0.986 


1955  0.961 


TE  =  total  states  expanded 

IDA*  =  states  expanded  during  IDA*  DFS 

EFF  =  average  efficiecy  per  ncxle 
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TE 

IDA* 

EFF 

27542 

762 

728 

0.999 

7311 

345 

339 

0.941 

2915 
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349 
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.381 
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Figure  B.25.  IDA*  States  Expanded  Data 
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Figure  B.26.  Level  p]xecution  Time  Data 
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Figure  B.27.  Level  States  Expanded  Data 
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Appendix  C.  Problem  Definition  and  Data 


C.I  Inlroduction 


This  appendix  provides  the  cost  matrices  and  solutions  for  the  four  main  problems  used  in 
tliis  researcli.  Each  section  is  divided  into  two  parts:  a  cost  matrix  and  a  solution  to  the  problem. 
‘\s  explained  in  Chapter  II,  the  cost  of  traveling  from  one  city  to  another  is  determined  by  selecting 
a  city,  finding  that  column,  and  then  finding  the  intersecting  row  of  the  city  that  you  are  traveling 
to.  For  example,  in  the  22  city  problem,  the  cost  of  traveling  from  city  4  to  city  7  is  53  while  the 
cost  of  traveling  from  city  7  to  city  4  is  67.  .Again  notice  that  the  costs  are  not  symetrical.  A  cost 
of  999  indicati's  infinite  cost. 


C.  2  Problfiii  ndda 

riiis  is  the  cost  matrix  for  problem  n22a: 
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This  is  a  solution  for  the  n22a  problem; 
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1-6-12-8-20-22-4-11-13-15-21-2-3-5-18-10-17-14-16-19-9-7-1 


at  cost  180 


C.'i  Problem  n55a 

This  is  the  cost  matrix  for  problem  n55a: 
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A  solution  to  the  65  city  problem  is: 

1-60-5-38-14-6-25-58-36-57-59-23-49-63-42-4-8-48-44-52-18 
-27-53-13-7-55-31-21-50-32-30-64-24-10-12-61-35-40-19-37-65 
-9-39-22-45-28-17-62-41-29-56-26-47-43-51-33-20-3-34-11-16 
-2-15-54-46-1 
at  cost  136 


C.5  Problem  nlOOa 


This  is  the  cost  matrix  for  problem  nlOOa: 
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A  solution  to  the  100  city  problem  is: 

1-58-39-78-19-34-28-55-99-61-54-80-62-92-10-100-13-96-21-52-45 

-23-95-65-63-68-70-84-82-66-16-75-86-49-77-69-94-35-3-7-42 

-73-6-91-30-22-37-17-51-15-36-44-48-40-38-43-41-32-88-25-31 

-90-97-83-93-72-12-9-27-50-87-98-79-4-81-67-8-76-59-26-24 

-18-57-47-2-5-71-60-85-20-74-53-29-14-56-46-64-33-89-11-1 
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The  purpose  of  this  research  is  to  explore  methods  used  to  parallelize  NP-complete  problems  and  the  degree  of 
improvement  that  can  be  realized  using  different  methods  of  load  balancing. 

A  serial  and  four  parallel  A*  branch  and  bound  algorithms  were  implemented  and  executed  on  an  Intel  iPSC/2 
hypercube  computer.  One  parallel  algorithm  used  a  global,  or  centralized,  list  to  store  unfinished  work  and  the 
other  three  parallel  algorithms  used  a  distributed  list  to  store  unfinished  work  locally  on  each  processor. 

'file  three  distributed  list  algorithms  are:  without  load  balancing,  with  load  balancing,  and  with  load  balancing 
and  work  distribution.  The  difference  between  load  balancing  and  work  distribution  is  load  balancing  only  occurs 
when  a  processor  becomes  idle  and  work  distribution  attempts  to  emulate  the  global  list  of  unfinished  work  by 
sharing  work  throughout  the  algorithm,  not  just  at  the  end.  Factors  which  effect  when  and  how  often  to  load 
balance  are  also  investigated. 

Which  algorithm  performed  best  depended  on  how  many  processors  were  used  to  solve  the  problem.  For  a  small 
number  of  processors,  16  or  less,  the  centralized  list  algorithm  easily  outperformed  all  others.  However,  after 
16  processors,  the  overhead  of  all  processors  trying  to  communicate  and  request  work  from  the  same  centralized 
list  began  to  outweigh  any  benefits  of  having  a  global  list.  Now  the  distributed  list  algorithms  began  to  perform 
best.  When  using  32  proces.sors,  the  distributed  list  with  load  balancing  and  work  distribution  out  performed 
the  other  algorithms. 
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